d ata m ining and olap

CREATE THE DIFFERENCE

Data Miningand OLAP

Stages of Data Mining and OLAP (with thanks to Janet Francis)


Aims

• This lecture aims to cover– The nature of data mining– Stages of Data Mining– OLAP


What is Data Mining• The term Data Mining is used because mining for

valuable data in a large database is similar to mining for a valuable ore in a huge mountain. – In a mining operation large amounts of low grade

materials are sifted through in order to find something of value.

– In its computing counterpart large volumes of data are searched in an attempt to find something worthwhile.


A useful Scenario

• The following scenario will be used in this lecture in order to make the processes seem more relevant.– Beaconside JAMS PLC (BJP) supplies jams/cake

fillings/sauces to confectioners and bakers. Customers include large multi nationals and small specialist outlets.

– BJ has centres in England, Scotland and Spain– BJ is not a manufacturing organisation – it is a retailer

which means that it buys from a distributor and sells on to customers.


Human vs Data Mining

• Human – Usually takes the form of hypothesis verification

• The analyst has a theory – we sell more high margin goods pro rata to specialist outlets than to multinationals and specialist outlets are more profitable

• The analyst gathers the necessary data and proves or disproves or amends and re tests the hypothesis

• Data Mining – can perform hypothesis verification – on vast quantities of data

• Data Mining allows the user to discover patterns that the user did not know existed!


Types of Data Mining

• Directed• Undirected


Directed Data Mining

• A top down approach – used when there is some idea of what is being looked for (some direction for the search) or and idea of what might be predicted

• The goal is to create a predictive model or set of models from the existing data which can then be used to predict future trends.

• For example - which customers are most likely to be interested in a new type of cake filling


Undirected Data Mining

• A bottom-up approach - the data itself determines the relationships – for example using clustering– If patterns are found it is for the user to determine whether the patterns are useful or not.

• The goal is to find patterns in the existing data. – Human interaction is necessary because only people can

determine what significance, if any, the patterns have • This type of data mining is one of the key steps in

Knowledge Discovery in Databases (KDD). • Necessary to know how the model works and how it

comes up with the answer in order to decide if patterns are valid

• Example: People who are over 5ft tall with brown hair like Blackcurrent Jam


Approaches to Data Mining

• Descriptive– Describes the current data in terms of rules or

patterns

• Predictive– Identify a set of rules/model which can be used to

predict currently unknown values


Descriptive Data Mining uses

• Market Basket Analysis • Clustering• Classification


Descriptive Data Mining uses: Market Basket Analysis

• Identifies relationships between data – for example, patterns in transaction purchases

• A rule(s) can be developed. The rule is supported depending on the frequency of the occurrence and a confidence interval can be calculated and expressed as a ratio

• This is also known as market basket analysis• For example: People who buy Blackcurrent

Jam also buy Redcurrent Jelly – Beer and Nappies?


Example

• BJP analysts discovered that sales of Strawberry Jam increased:– When the customer was offered a small pot of

Blackcurrent Jam free with the purchase – With the height of the person buying the product

• How commercially useful is this information?

• Just because there is a correlation, does not mean it is useful


Descriptive Data Mining uses: Clustering

• Identifies the natural groupings within data – e.g. customers may be classified into groups – known as customer segmentation this is useful in Customer Relationship Management (CRM)

• Data items within groups should be as similar as possible to each other and as different as possible to other groups

• Need to determine parameters which will result in realistic clusters


Example

• BJP has identified clusters of customers who buy only jam, customers who buy only cake fillings, customers who buy both

• How would this be commercially useful?


Descriptive Data Mining uses: Classification

• Data of interest is sorted into predefined classes

• BJP classifies customers as – Multinational;– UK based; – independent chain; – single outlet


Predictive Data Mining Use

• Customers in the single outlet category typically order jams and sauces but not cake fillings

• A new client is placed in the single outlet category – it is possible to predict likely ordering patterns


Stages in Data-Mining

1. Preparation of data– This stage involves selection and preparation of input data

from a variety of sources• Data integration• Data cleansing• Data warehousing (this usually includes the above)

2. Mining stage– This stage involves producing useful predictive models (OLAP)

3. Interpretation and Evaluation – Knowledge Discovery– The final stage involves deploying the models and applying

them to new data in order to generate predictions or new knowledge.


1. Preparation of Data

• Input data must be in or converted to electronic form. It could come from a variety of different sources such as:– Operational Databases (sales, finance etc.)– Commercial Databases (demographics)– Internet documents– Spreadsheets or other “office” documents

• The input data must be integrated and cleansed.• Note – much of the preparation is complete in a

data warehouse


Data Integration

• Data from different sources must be integrated to provide heterogeneity – Involves de-normalisation of databases– Dates and times must be of the same format.– Records must be in the same type


Data Cleansing

• Once integrated, the data must be cleansed to resolve the following issues– Duplicate data

• Need to delete– Missing values (unrecorded or really missing?)

• Unrecorded - might not have been required in one or more of the contributing data sets. Could be added if based on other values eg. Post code.

• Really missing- could actually denotes a missing value eg. An unpaid bill.

• Need to decide how missing values will be represented.– Irrelevant values

• Need expert to identify sets and delete– Inaccurate data

• can identify anomalies by using graphs and clusters. Values outside the normal expected range can be investigated.

– Old data• Need to delete


What are demographic overlays?• Most customer databases include post codes.• Various data is collected via census and based on post

codes eg.– Gender Distribution – Age distribution

• Other data is known about areas eg– Proximity to the coast– Major employers– Proximity to National parks

• This data could be used in conjunction with customer data to predict trends. Eg– If a product sells well in one area close to the coast with a

higher than average percentage of old ladies, then it might be worth marketing that product in other such areas.


2. Mining stage

Customer names in a certain post code area

It is known that in this area75% of the population is considered Rich and 75% is male

A Typical Data Set


Histograms

Number

0 2 4 6 8 10 12 14

Female

Male

0 1 2 3 4 5 6 7 8

21-30

31-40

41-50

51-60

61-70

Poor

Rich

0% 20% 40% 60% 80% 100%

21-30

31-40

41-50

51-60

61-70

Poor

Rich

1 dimensional 2 dimensional


Into the 3rd Dimension

M

FFemale

Male

Poor

Rich

•Even with just two attributes each with two values the table is more difficult to understand. •What if there were 16 attributes each with multiple values?•The number of 2d histograms which could be potentially useful would be over 100.•This structure is known as an OLAP cube.


On-line Analytical Processing OLAP

• OLAP functionality is characterised by dynamic multi-dimensional analysis of consolidated enterprise data: – Slice: A slice is a subset of a multi-dimensional array

corresponding to a single value for one or more members of the dimensions not in the subset.

– Dice: The dice operation is a slice on more than two dimensions of a data cube (or more than two consecutive slices).

– Drill Down/Up: Drilling down or up is a specific analytical technique whereby the user navigates among levels of data ranging from the most summarized (up) to the most detailed (down).

– Roll-up: A roll-up involves computing all of the data relationships for one or more dimensions. To do this, a computational relationship or formula might be defined.

– Pivot: To change the dimensional orientation of a report or page display


OLAP

• Uses various algorithms – examples are:a. Decision treesb. K-Nearest neighbour


Decision trees

• The Decision Tree is one of the most popular classification algorithms in current use in Data Mining– A decision tree takes as input an object or situation described by a

set of properties, and outputs a yes/no decision. – Algorithm is recursive partitioning – divide and conquer.– Internal nodes denote a test.– A branch represents the outcome– The leaf nodes represent the class.

• The algorithm is simple but extremely powerful.


Decision tree exampleCandidate for class

labelRabbit

Rabbit does not have Wings

Rabbit has whiskers

Rabbit does not eat meat

Rabbit does not swim

Rabbit has legs

Y N

Y N

Y N

Y N

Y N

RulesTests

Wings?

Swims?

legs?

Whiskers?

Eats Meat?

Not in class!

Internal node

Leaf node


Need to decide

• Which attributes to select in order to identify the class of the sample as quickly as possible?

• When to stop?– No remaining attributes to test or when the class is

determined


K-Nearest neighbour• k-nearest neighbour algorithm (k-NN) is a popular method for

classification – feature space is a multidimensional space where each pattern

sample is represented as a point whose dimension is determined by the number of features used to describe the patterns.

– Firstly the training samples and their class labels are plotted in the multidimensional feature space. The space is then partitioned into regions by class labels of the training samples. The training phase of the algorithm consists simply of plotting the points in the feature space.

– In the actual classification phase, the same features as before are computed for a test sample. Distances from the new point to all stored points are computed and k closest samples are selected. The test sample is assigned to the class whose label is the most frequent among the k nearest training samples.

• The algorithm is easy to implement, but it is computationally intensive, especially when the size of the training set grows.


K-Nearest neighbour Example

This is simplistic – usually 16 or more attributes are used.The small coloured dots are the training samplesEach colour represents a different class labelThe large black dots are test samples

When K> 5 boundaries are less distinct in most cases


Need to decide

• A value for k

• The best choice of k depends upon the data– generally, larger values of k reduce the effect of noise on the

classification, but make boundaries between classes less distinct


3. Interpretation and Evaluation

• Uses of Data Mining in business– Market segmentation

• Identify the common characteristics of customers who buy certain products from a company.

– Customer churn• Predict which customers are likely to leave your company and go to a

competitor.

– Fraud detection• Identify which transactions are most likely to be fraudulent.

– Direct marketing• Identify which prospects should be included in a mailing list to obtain the

highest response rate.

– Supermarket basket analysis • Understand what products or services are commonly purchased together.

– Trend analysis• Reveal the difference between a typical customer this month and

last. Allows organisations to map trends and

Further Reading

• The OLAP report• A view from QUB• Date chapter 22


d ata m ining and olap

Documents

necessary data

valuable data

existing data

type of data mining

current data

hypothesisdata mining

specialist outlets

large database