exploratory input

Upload: nacho-elgorry

Post on 03-Jun-2018

230 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/12/2019 Exploratory Input

    1/40

    Fall 2003 Data Mining 1

    Exploratory Data Mining and

    Data Preparation

  • 8/12/2019 Exploratory Input

    2/40

    Fall 2003 Data Mining 2

    The Data Mining Process

    Business

    understanding

    DeploymentData

    Data

    preparation

    Modeling

    Data

    evaluation

    Evaluation

  • 8/12/2019 Exploratory Input

    3/40

    Fall 2003 Data Mining 3

    Exploratory Data Mining

    Preliminary process

    Data summaries

    Attribute means

    Attribute variation

    Attribute relationships

    Visualization

  • 8/12/2019 Exploratory Input

    4/40

    Fall 2003 Data Mining 4

    Select an attribute

    Summary Statistics

    Possible Problems:

    Many missing values (16%)

    No examples of one value

    Visualization

    Appears to be a

    good predictor of

    the class

  • 8/12/2019 Exploratory Input

    5/40

    Fall 2003 Data Mining 5

  • 8/12/2019 Exploratory Input

    6/40

    Fall 2003 Data Mining 6

    Exploratory DM Process

    For each attribute:

    Look at data summaries

    Identify potential problems and decide if anaction needs to be taken (may require

    collecting more data)

    Visualize the distribution

    Identify potential problems (e.g., one dominant

    attribute value, even distribution, etc.)

    Evaluate usefulness of attributes

  • 8/12/2019 Exploratory Input

    7/40

    Fall 2003 Data Mining 7

    Weka Filters

    Weka has many filters that are helpful inpreprocessing the data Attribute filters

    Add, remove, or transform attributes

    Instance filters Add, remove, or transform instances

    Process Choose for drop-down menu

    Edit parameters (if any)

    Apply

  • 8/12/2019 Exploratory Input

    8/40

    Fall 2003 Data Mining 8

    Data Preprocessing

    Data cleaning

    Missing values, noisy or inconsistent data

    Data integration/transformation

    Data reduction

    Dimensionality reduction, data

    compression, numerosity reduction

    Discretization

  • 8/12/2019 Exploratory Input

    9/40

    Fall 2003 Data Mining 9

    Data Cleaning

    Missing values

    Weka reports % of missing values

    Can use filter called ReplaceMissingValuesNoisy data

    Due to uncertainty or errors

    Weka reports unique values

    Useful filters include

    RemoveMisclassified

    MergeTwoValues

  • 8/12/2019 Exploratory Input

    10/40

    Fall 2003 Data Mining 10

    Data Transformation

    Why transform data?

    Combine attributes. For example, the ratio of two

    attributes might be more useful than keeping themseparate

    Normalizing data. Having attributes on the same

    approximate scale helps many data mining

    algorithms(hence better models)

    Simplifying data. For example, working with

    discrete data is often more intuitive and helps the

    algorithms(hence better models)

  • 8/12/2019 Exploratory Input

    11/40

    Fall 2003 Data Mining 11

    Weka Filters

    The data transformation filters in Wekainclude:

    AddAddExpression

    MakeIndicator

    NumericTransform

    Normalize

    Standardize

  • 8/12/2019 Exploratory Input

    12/40

    Fall 2003 Data Mining 12

    Discretization

    Discretization reduces the number of

    values for a continuous attribute

    Why?

    Some methods can only use nominal data

    E.g., in Weka ID3 and Apriori algorithms

    Helpful if data needs to be sortedfrequently (e.g., when constructing a

    decision tree)

  • 8/12/2019 Exploratory Input

    13/40

    Fall 2003 Data Mining 13

    Unsupervised Discretization

    Unsupervised - does not account for classes

    Equal-interval binning

    Equal-frequency binning

    64 65 68 69 70 71 72 75 80 81 83 85

    Yes No Yes Yes Yes No No

    Yes

    Yes

    Yes

    No Yes Yes No

    64 65 68 69 70 71 72 75 80 81 83 85

    Yes No Yes Yes Yes No No

    Yes

    Yes

    Yes

    No Yes Yes No

  • 8/12/2019 Exploratory Input

    14/40

    Fall 2003 Data Mining 14

    Take classification into account

    Use entropyto measure information gain

    Goal: Discretizise into 'pure' intervalsUsually no way to get completely pure intervals:

    Supervised Discretization

    64 65 68 69 70 71 72 75 80 81 83 85

    Yes No Yes Yes Yes No No

    Yes

    Yes

    Yes

    No Yes Yes No

    ABCDEF

    9 yes & 4 no 1 no1 yes 8 yes & 5 no

  • 8/12/2019 Exploratory Input

    15/40

    Fall 2003 Data Mining 15

    Error-Based Discretization

    Count number of misclassifications

    Majority class determines prediction

    Count instances that are different

    Must restrict number of classes.

    Complexity

    Brute-force: exponential time

    Dynamic programming: linear time

    Downside: cannot generate adjacent intervalswith same label

  • 8/12/2019 Exploratory Input

    16/40

    Fall 2003 Data Mining 16

    Weka Filter

  • 8/12/2019 Exploratory Input

    17/40

    Fall 2003 Data Mining 17

    Attribute Selection

    Before inducing a model we almostalways do input engineering

    The most useful part of this is attributeselection(also called feature selection)

    Select relevant attributes

    Remove redundant and/or irrelevantattributes

    Why?

  • 8/12/2019 Exploratory Input

    18/40

    Fall 2003 Data Mining 18

    Reasons for Attribute

    SelectionSimpler model More transparent

    Easier to interpret

    Faster model induction What about overall time?

    Structural knowledge

    Knowing which attributes are important may beinherently important to the application

    What about the accuracy?

  • 8/12/2019 Exploratory Input

    19/40

    Fall 2003 Data Mining 19

    Attribute Selection Methods

    What is evaluated?

    Attributes Subsets ofattributes

    EvaluationMethod

    Independent

    Filters Filters

    Learning

    algorithm Wrappers

  • 8/12/2019 Exploratory Input

    20/40

    Fall 2003 Data Mining 20

    Filters

    Results in either

    Ranked list of attributes

    Typical when each attribute is evaluatedindividually

    Must select how many to keep

    A selected subset of attributes

    Forward selection Best first

    Random search such as genetic algorithm

  • 8/12/2019 Exploratory Input

    21/40

    Fall 2003 Data Mining 21

    Filter Evaluation Examples

    Information Gain

    Gain ration

    Relief

    Correlation

    High correlation with class attribute

    Low correlation with other attributes

  • 8/12/2019 Exploratory Input

    22/40

    Fall 2003 Data Mining 22

    Wrappers

    Wrap aroundthe

    learning algorithm

    Must therefore always

    evaluate subsets

    Return the best subset

    of attributes

    Apply for each learning

    algorithm

    Use same search

    methods as before

    Select a subset of

    attributes

    Induce learning

    algorithm on this subset

    Evaluate the resulting

    model (e.g., accuracy)

    Stop? YesNo

  • 8/12/2019 Exploratory Input

    23/40

    Fall 2003 Data Mining 23

    How does it help?

    Nave Bayes

    Instance-based learning

    Decision tree induction

  • 8/12/2019 Exploratory Input

    24/40

    Fall 2003 Data Mining 24

  • 8/12/2019 Exploratory Input

    25/40

    Fall 2003 Data Mining 25

    Scalability

    Data mining uses mostly well developed

    techniques (AI, statistics, optimization)

    Key difference: very large databases

    How to deal with scalability problems?

    Scalability: the capability of handling

    increased load in a way that does not

    effect the performance adversely

  • 8/12/2019 Exploratory Input

    26/40

    Fall 2003 Data Mining 26

    Massive Datasets

    Very large data sets (millions+ of

    instances, hundreds+ of attributes)

    Scalability in space and time

    Data set cannot be kept in memory

    E.g., processing one instance at a time

    Learning time very long How does the time depend on the input?

    Number of attributes, number of instances

  • 8/12/2019 Exploratory Input

    27/40

    Fall 2003 Data Mining 27

    Two Approaches

    Increased computational power

    Only works if algorithms can be sped up

    Must have the computing availability

    Adapt algorithms

    Automatically scale-down the problem so

    that it is always approximately the samedifficulty

  • 8/12/2019 Exploratory Input

    28/40

    Fall 2003 Data Mining 28

    Computational Complexity

    We want to design algorithms with good

    computational complexity

    exponential

    linear

    logarithm

    Number of instances

    (Number of attributes)

    Timepolynomial

  • 8/12/2019 Exploratory Input

    29/40

    Fall 2003 Data Mining 29

    Example: Big-Oh Notation

    Define n =number of instances

    m =number of attributes

    Going once through all the instances hascomplexity O(n)

    Examples

    Polynomial complexity: O(mn2

    ) Linear complexity: O(m+n)

    Exponential complexity: O(2n)

  • 8/12/2019 Exploratory Input

    30/40

    Fall 2003 Data Mining 30

    Classification

    If no polynomial time algorithm exists to solvea problem it is called NP-complete

    Finding the optimal decision tree is anexample of a NP-complete problem

    However, ID3 and C4.5 are polynomial timealgorithms

    Heuristic algorithms to construct solutions to adifficult problem

    Efficientfrom a computational complexitystandpoint but still have a scalability problem

  • 8/12/2019 Exploratory Input

    31/40

    Fall 2003 Data Mining 31

    Decision Tree Algorithms

    Traditional decision tree algorithms assumetraining set kept in memory

    Swapping in and out of main and cachememory expensive

    Solution: Partition data into subsets

    Build a classifier on each subset Combine classifiers

    Not as accurate as a single classifier

  • 8/12/2019 Exploratory Input

    32/40

    Fall 2003 Data Mining 32

    Other Classification Examples

    Instance-Based Learning

    Goes through instances one at a time

    Compares with new instance Polynomial complexity O(mn)

    Response time may be slow, however

    Nave Bayes

    Polynomial complexity

    Stores a very large model

  • 8/12/2019 Exploratory Input

    33/40

    Fall 2003 Data Mining 33

    Data Reduction

    Another way is to reduce the size of the

    data before applying a learning

    algorithm (preprocessing)Some strategies

    Dimensionality reduction

    Data compression Numerosity reduction

  • 8/12/2019 Exploratory Input

    34/40

    Fall 2003 Data Mining 34

    Dimensionality Reduction

    Remove irrelevant, weakly relevant, and

    redundant attributes

    Attribute selection Many methods available

    E.g., forward selection, backwards elimination,

    genetic algorithm search

    Often much smaller problem

    Often little degeneration in predictive

    performance or even better performance

  • 8/12/2019 Exploratory Input

    35/40

    Fall 2003 Data Mining 35

    Data Compression

    Also aim for dimensionality reduction

    Transform the data into a smaller space

    Principle Component Analysis Normalize data

    Compute corthonormal vectors, orprinciple

    components, that provide a basis for normalized

    data Sort according to decreasing significance

    Eliminate the weaker components

  • 8/12/2019 Exploratory Input

    36/40

    Fall 2003 Data Mining 36

    PCA: Example

  • 8/12/2019 Exploratory Input

    37/40

    Fall 2003 Data Mining 37

    Numerosity Reduction

    Replace data with an alternative,

    smaller data representation

    Histogram

    1-10 11-20 21-30

    1,1,5,5,5,5,5,8,8,10,10,10,10,12,14,14,14,15,15,15,

    15,15,15,18,18,18,18,18,18,18,18,20,20,20,20,20,

    20,20,21,21,21,21,25,25,25,25,25,28,28,30,30,30

  • 8/12/2019 Exploratory Input

    38/40

    Fall 2003 Data Mining 38

    Other Numerosity Reduction

    Clustering

    Data objects (instance) that are in the

    same cluster can be treated as the sameinstance

    Must use a scalable clustering algorithm

    Sampling Randomly select a subset of the instances

    to be used

  • 8/12/2019 Exploratory Input

    39/40

    Fall 2003 Data Mining 39

    Sampling Techniques

    Different samples

    Sample without replacement

    Sample with replacement Cluster sample

    Stratified sample

    Complexity of sampling actually sublinear,

    that is, the complexity is O(s) where sis thenumber of samples and s

  • 8/12/2019 Exploratory Input

    40/40

    F ll 2003 D t Mi i 40

    Weka Filters

    PrincipalComponentsis under theAttribute Selection tab

    Already talked about filters to discretizethe data

    The Resamplefilter randomly samplesa given percentage of the data If you specify the same seed, youll get the

    same sample again