exploratory input

8/12/2019 Exploratory Input

1/40

Fall 2003 Data Mining 1

Exploratory Data Mining and

Data Preparation


2/40


The Data Mining Process

Business

understanding

DeploymentData

Data

preparation

Modeling

Data

evaluation

Evaluation


3/40


Exploratory Data Mining

Preliminary process

Data summaries

Attribute means

Attribute variation

Attribute relationships

Visualization


4/40


Select an attribute

Summary Statistics

Possible Problems:

Many missing values (16%)

No examples of one value

Visualization

Appears to be a

good predictor of

the class


5/40



6/40


Exploratory DM Process

For each attribute:

Look at data summaries

Identify potential problems and decide if anaction needs to be taken (may require

collecting more data)

Visualize the distribution

Identify potential problems (e.g., one dominant

attribute value, even distribution, etc.)

Evaluate usefulness of attributes


7/40


Weka Filters

Weka has many filters that are helpful inpreprocessing the data Attribute filters

Add, remove, or transform attributes

Instance filters Add, remove, or transform instances

Process Choose for drop-down menu

Edit parameters (if any)

Apply


8/40


Data Preprocessing

Data cleaning

Missing values, noisy or inconsistent data

Data integration/transformation

Data reduction

Dimensionality reduction, data

compression, numerosity reduction

Discretization


9/40


Data Cleaning

Missing values

Weka reports % of missing values

Can use filter called ReplaceMissingValuesNoisy data

Due to uncertainty or errors

Weka reports unique values

Useful filters include

RemoveMisclassified

MergeTwoValues


10/40


Data Transformation

Why transform data?

Combine attributes. For example, the ratio of two

attributes might be more useful than keeping themseparate

Normalizing data. Having attributes on the same

approximate scale helps many data mining

algorithms(hence better models)

Simplifying data. For example, working with

discrete data is often more intuitive and helps the

algorithms(hence better models)


11/40


Weka Filters

The data transformation filters in Wekainclude:

AddAddExpression

MakeIndicator

NumericTransform

Normalize

Standardize


12/40


Discretization

Discretization reduces the number of

values for a continuous attribute

Why?

Some methods can only use nominal data

E.g., in Weka ID3 and Apriori algorithms

Helpful if data needs to be sortedfrequently (e.g., when constructing a

decision tree)


13/40


Unsupervised Discretization

Unsupervised - does not account for classes

Equal-interval binning

Equal-frequency binning

64 65 68 69 70 71 72 75 80 81 83 85

Yes No Yes Yes Yes No No

Yes

Yes

Yes

No Yes Yes No

64 65 68 69 70 71 72 75 80 81 83 85


Yes

Yes

Yes

No Yes Yes No


14/40


Take classification into account

Use entropyto measure information gain

Goal: Discretizise into 'pure' intervalsUsually no way to get completely pure intervals:

Supervised Discretization

64 65 68 69 70 71 72 75 80 81 83 85


Yes

Yes

Yes

No Yes Yes No

ABCDEF

9 yes & 4 no 1 no1 yes 8 yes & 5 no


15/40


Error-Based Discretization

Count number of misclassifications

Majority class determines prediction

Count instances that are different

Must restrict number of classes.

Complexity

Brute-force: exponential time

Dynamic programming: linear time

Downside: cannot generate adjacent intervalswith same label


16/40


Weka Filter


17/40


Attribute Selection

Before inducing a model we almostalways do input engineering

The most useful part of this is attributeselection(also called feature selection)

Select relevant attributes

Remove redundant and/or irrelevantattributes

Why?


18/40


Reasons for Attribute

SelectionSimpler model More transparent

Easier to interpret

Faster model induction What about overall time?

Structural knowledge

Knowing which attributes are important may beinherently important to the application

What about the accuracy?


19/40


Attribute Selection Methods

What is evaluated?

Attributes Subsets ofattributes

EvaluationMethod

Independent

Filters Filters

Learning

algorithm Wrappers


20/40


Filters

Results in either

Ranked list of attributes

Typical when each attribute is evaluatedindividually

Must select how many to keep

A selected subset of attributes

Forward selection Best first

Random search such as genetic algorithm


21/40


Filter Evaluation Examples

Information Gain

Gain ration

Relief

Correlation

High correlation with class attribute

Low correlation with other attributes


22/40


Wrappers

Wrap aroundthe

learning algorithm

Must therefore always

evaluate subsets

Return the best subset

of attributes

Apply for each learning

algorithm

Use same search

methods as before

Select a subset of

attributes

Induce learning

algorithm on this subset

Evaluate the resulting

model (e.g., accuracy)

Stop? YesNo


23/40


How does it help?

Nave Bayes

Instance-based learning

Decision tree induction


24/40



25/40


Scalability

Data mining uses mostly well developed

techniques (AI, statistics, optimization)

Key difference: very large databases

How to deal with scalability problems?

Scalability: the capability of handling

increased load in a way that does not

effect the performance adversely


26/40


Massive Datasets

Very large data sets (millions+ of

instances, hundreds+ of attributes)

Scalability in space and time

Data set cannot be kept in memory

E.g., processing one instance at a time

Learning time very long How does the time depend on the input?

Number of attributes, number of instances


27/40


Two Approaches

Increased computational power

Only works if algorithms can be sped up

Must have the computing availability

Adapt algorithms

Automatically scale-down the problem so

that it is always approximately the samedifficulty


28/40


Computational Complexity

We want to design algorithms with good

computational complexity

exponential

linear

logarithm

Number of instances

(Number of attributes)

Timepolynomial


29/40


Example: Big-Oh Notation

Define n =number of instances

m =number of attributes

Going once through all the instances hascomplexity O(n)

Examples

Polynomial complexity: O(mn2

) Linear complexity: O(m+n)

Exponential complexity: O(2n)


30/40


Classification

If no polynomial time algorithm exists to solvea problem it is called NP-complete

Finding the optimal decision tree is anexample of a NP-complete problem

However, ID3 and C4.5 are polynomial timealgorithms

Heuristic algorithms to construct solutions to adifficult problem

Efficientfrom a computational complexitystandpoint but still have a scalability problem


31/40


Decision Tree Algorithms

Traditional decision tree algorithms assumetraining set kept in memory

Swapping in and out of main and cachememory expensive

Solution: Partition data into subsets

Build a classifier on each subset Combine classifiers

Not as accurate as a single classifier


32/40


Other Classification Examples

Instance-Based Learning

Goes through instances one at a time

Compares with new instance Polynomial complexity O(mn)

Response time may be slow, however

Nave Bayes

Polynomial complexity

Stores a very large model


33/40


Data Reduction

Another way is to reduce the size of the

data before applying a learning

algorithm (preprocessing)Some strategies

Dimensionality reduction

Data compression Numerosity reduction


34/40


Dimensionality Reduction

Remove irrelevant, weakly relevant, and

redundant attributes

Attribute selection Many methods available

E.g., forward selection, backwards elimination,

genetic algorithm search

Often much smaller problem

Often little degeneration in predictive

performance or even better performance


35/40


Data Compression

Also aim for dimensionality reduction

Transform the data into a smaller space

Principle Component Analysis Normalize data

Compute corthonormal vectors, orprinciple

components, that provide a basis for normalized

data Sort according to decreasing significance

Eliminate the weaker components


36/40


PCA: Example


37/40


Numerosity Reduction

Replace data with an alternative,

smaller data representation

Histogram

1-10 11-20 21-30

1,1,5,5,5,5,5,8,8,10,10,10,10,12,14,14,14,15,15,15,

15,15,15,18,18,18,18,18,18,18,18,20,20,20,20,20,

20,20,21,21,21,21,25,25,25,25,25,28,28,30,30,30


38/40


Other Numerosity Reduction

Clustering

Data objects (instance) that are in the

same cluster can be treated as the sameinstance

Must use a scalable clustering algorithm

Sampling Randomly select a subset of the instances

to be used


39/40


Sampling Techniques

Different samples

Sample without replacement

Sample with replacement Cluster sample

Stratified sample

Complexity of sampling actually sublinear,

that is, the complexity is O(s) where sis thenumber of samples and s


40/40

F ll 2003 D t Mi i 40

Weka Filters

PrincipalComponentsis under theAttribute Selection tab

Already talked about filters to discretizethe data

The Resamplefilter randomly samplesa given percentage of the data If you specify the same seed, youll get the

same sample again

exploratory input

Documents