lect6.pdf

Upload: erjaimin89

Post on 06-Jul-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/17/2019 lect6.pdf

    1/66

    Artificial Intelligence:-

    1

     

    Russell & Norvig: Sections 18.1 to 18.4

  • 8/17/2019 lect6.pdf

    2/66

    Motivation Too many to list here!

    Recommender systems eg. Amazon --> you might like this book… eg. LinkedIn --> people you might know…

    Pattern Recognition

      Learning to recognize postal codes Handwritten recognition --> who wrote this

    cheque?  …

  • 8/17/2019 lect6.pdf

    3/66

    Why Machine Learning? To construct programs that automatically

    improve with experience. Learning = crucial characteristics of an

    intelligent agent

     

    3

    A learning system performs tasks and improves its performance

    the more tasks it accomplishes generalizes from given experiences and is able

    to make judgments on unseen cases

  • 8/17/2019 lect6.pdf

    4/66

    Machine Learning vs Data Mining Supervised Machine Learning Try to predict the outcome of a new data from a training set of

    already classified examples

    I know what I am looking for, and I am using this training data tohelp me generalise 

    Eg: diagnostic system: which treatments are best for new diseases?

    Eg: speech recognition : given the acoustic pattern, which word wasuttered

    4

     

    Unsupervised Machine Learning/ Data Mining / KnowledgeDiscovery Try to discover valuable knowledge / pattern from large databases

    of grocery bills, financial transactions, medical records, etc.

    I’m not sure what I am looking for, but I am sure there’s somethinginteresting 

    Eg: Anomaly detection: odd use of credit card?

    Eg. Learning association rules: buy strawberries & whipped cream

    together?

  • 8/17/2019 lect6.pdf

    5/66

    Types of Learning In Supervised learning

    We are given a training set of (X, f(X)) pairs

    In Reinforcement learning We are not given the (X, f(X)) pairs

     

    big nose big teeth big eyes no moustache f(X) = not person

    small nose small teeth small eyes no moustache f(X) = person

    small nose big teeth small eyes moustache f(X) = ?

    small nose big teeth small eyes moustache f(X) = ?

    5

    ut some ow we are to w et er our earne is rig t or wrong

    Goal: maximize the nb of right answers

    In Unsupervised learning We are only given the Xs - not the corresponding f(X)

    No teacher involved / Goal: find regularities among the Xs (clustering) Data mining

    big nose big teeth big eyes no moustache not given 

    small nose small teeth small eyes no moustache not given 

    small nose big teeth small eyes moustache f(X) = ?

  • 8/17/2019 lect6.pdf

    6/66

    Remember this slide…

    6

  • 8/17/2019 lect6.pdf

    7/66

    Inductive Learning = learning from examples

    Most work in ML

    Examples are given (positive and/or negative) to train asystem in a classification (or regression) task

    Extrapolate from the training set to make accuratepredictions about future examples

     

    7

    an e seen as earning a unction Given a new instance X you have never seen

    You must find an estimate of the function f(X) where f(X) isthe desired output

    Ex:

    X = features of a face (ex. small nose, big teeth, …)

    f(X) = function to tell if X represents a human face or not

    small nose big teeth small eyes moustache f(X) = ?

    X

  • 8/17/2019 lect6.pdf

    8/66

    Example Given 5 pairs (X,f(X)) (the training set)

    Find a function that fits the training set well

    So that given a new X, you can predict its f(X) valuef(X)

    8

    Note: choosing one function over another beyond justlooking at the training set is called inductive bias ex: prefer "smoother" functions

    ex: if an attribute does not seem discriminating, drop it

    source: Rusell & Norvig (2003)

    X

  • 8/17/2019 lect6.pdf

    9/66

    Inductive Learning Framework

    Input data are represented by a vector of features, X

    Each vector X is a list of (attribute, value) pairs. Ex: X = [nose:big, teeth:big, eyes:big, moustache:no]

    The number of features is fixed (positive, finite)

     

    9

    Each attribute has a fixed, finite number of possible values Each example can be interpreted as a point in a n-dimensional

    feature space where n is the number of features

    if the output is discrete --> classification

    if the output is continuous --> regression

  • 8/17/2019 lect6.pdf

    10/66

    Example

    10

    Real ML applications typically require hundreds or thousands of examples

    source: Alison Cawsey: The Essence of AI (1997).

  • 8/17/2019 lect6.pdf

    11/66

    Techniques in ML Probabilistic Methods

    ex: Naïve Bayes Classifier

    Decision Trees Use only discriminating features as questions in a big if-then-else

    tree

    Genetic algorithms

    11

    Good functions evolve out of a population by combining possiblesolutions to produce offspring solutions and killing off the weakerof those solutions.

    Neural networks Also called parallel distributed processing or connectionist systems

    Intelligence arise from having a large number of simplecomputational units

  • 8/17/2019 lect6.pdf

    12/66

    Today Last time: Probabilistic Methods

    Decision trees

    ( Evaluation Unsupervised learning)

    Next time: Neural Networks

    12

    Next time: Genetic Algorithms

  • 8/17/2019 lect6.pdf

    13/66

    Guess Who?

    13

    Play online

  • 8/17/2019 lect6.pdf

    14/66

    Decision Trees Simplest, but most successful form of learning algorithm

    Very well-know algorithm is ID3 (Quinlan, 1987) and itssuccessor C4.5

    Look for features that are very good indicators of the

    14

    ,the tree

    Split the examples so that those with different valuesfor the chosen feature are in a different set

    Repeat the same process with another feature

  • 8/17/2019 lect6.pdf

    15/66

    ID3 / C4.5 Algorithm Top-down construction of the decision tree

    Recursive selection of the “best feature” to use at thecurrent node in the tree

    Once the feature is selected for the current node,enerate children nodes, one for each ossible value of

    15

     

    the selected attribute Partition the examples using the possible values of this

    attribute, and assign these subsets of the examples tothe appropriate child node

    Repeat for each child node until all examplesassociated with a node are either all positive or allnegative

  • 8/17/2019 lect6.pdf

    16/66

    ExampleInfo on last year’s students to determine if a student will get an ‘A’ this year

    Features Output f(X)

    Student ‘A’ last year?

    Blackhair?

    Workshard?

    Drinks? ‘A’ this year?

     

    16

     

    X2: Alan Yes Yes Yes No  Yes

    X3: Alison No No Yes No No

    X4: Jeff No Yes No Yes No

    X5: Gail Yes No Yes Yes  Yes

    X6: Simon No Yes Yes Yes No

  • 8/17/2019 lect6.pdf

    17/66

    Example“A” last year?

     

     yes no

    Features Outputf(X)

    Student ‘A’ last

     year?

    Black

    hair?

    Works

    hard?

    Drinks

    ?

    ‘A’ this

    year?Richard Yes Yes No Yes No

    Alan Yes Yes Yes No  Yes

    Alison No No Yes No No

    Jeff No Yes No Yes No

    17

    Works hard?

    Output = No Gail Yes No Yes Yes  Yes

    Simon No Yes Yes Yes No

    Output = Yes Output = No

     yes no

  • 8/17/2019 lect6.pdf

    18/66

    Example 2: The Restaurant Goal: learn whether one should wait for a table

    Attributes

    Alternate: another suitable restaurant nearby Bar: comfortable bar for waiting

    Fri/Sat: true on Fridays and Saturdays

    :

    18

     

    Patrons: how many people are present (none, some, full) Price: price range ($, $$, $$$)

    Raining: raining outside

    Reservation: reservation made

    Type: kind of restaurant (French, Italian, Thai, Burger)

    WaitEstimate: estimated wait by host (0-10 mins, 10-30, 30-60, >60)

  • 8/17/2019 lect6.pdf

    19/66

    Example 2: The Restaurant Training data:

    19source: Norvig (2003)

  • 8/17/2019 lect6.pdf

    20/66

    A First Decision Tree

    20

    But is it the best decision tree we can build?

    source: Norvig (2003)

  • 8/17/2019 lect6.pdf

    21/66

    Ockham’s Razor PrincipleIt is vain to do more than can be done with less… Entities

    should not be multiplied beyond necessity. [Ockham, 1324] 

    In other words… always favor the simplest answerthat correctly fits the training data

     

    21

    i.e. the smallest tree on average

    This type of assumption is called inductive bias

    inductive bias = making a choice beyond what the traininginstances contain

  • 8/17/2019 lect6.pdf

    22/66

    Finding the Best Tree can be seen as searching the space

    of all possible decision trees

    Inductive bias: prefer shortertrees on average

    how?

     

    empty tree

    22

    searc e space o a ec s ontrees always pick the next attribute to

    split the data based on its"discriminating power" (information

    gain) in effect, hill-climbing search

    where heuristic is information gain

    source: Tom Mitchell, Machine Learning (1997)

    complete tree

  • 8/17/2019 lect6.pdf

    23/66

    Which Tree is Best?F1?

    F2? F3?

    F4? F5? F6? F7?

    class class class class class class class class

    F1?

    23

    F2?

    F3?

    class

    class

    class F4?

    class F5?

    class F6?

    class F7?

    class class

  • 8/17/2019 lect6.pdf

    24/66

    Choosing the Next AttributeThe key problem is choosing which

    feature to split a given set of examples

    ID3 uses Maximum Information-Gain:

     

    24

    Choose the attribute that has the largestinformation gain i.e., the attribute that will result in the smallest

    expected size of the subtrees rooted at itschildren

    information theory

  • 8/17/2019 lect6.pdf

    25/66

    Intuitively… Patron :

    If value is Some … all outputs=Yes  If value is None … all outputs=No  If value is Full… we need more tests

    Type : If value is French … we need more tests

    Output f(X)

    25

     

    If value is Italian … we need more tests If value is Thai… we need more tests If value is Burger… we need more tests

    So patron  is more discriminating…

    source: Norvig (2003)

  • 8/17/2019 lect6.pdf

    26/66

    Next Feature…

    For only data where patron = Full  hungry 

    If value is Yes … we need more tests If value is No … all output= No 

    type : 

    26

     va ue is Frenc  … a output= No  If value is Italian … all output= No  If value is Thai… we need more tests If value is Burger… we need more tests

    So hungry is more discriminating (only 1 new branch)…

  • 8/17/2019 lect6.pdf

    27/66

    A Better Decision Tree

    4 tests instead of 9

    11 branches instead of 21

    27source: Norvig (2003)

  • 8/17/2019 lect6.pdf

    28/66

    Choosing the Next Attribute The key problem is choosing which feature to

    split a given set of examples Most used strategy: information theory

    28

    p(x)p(x)logH(X) Xx 2∈−=

    bit121log2121log21 

    2

    1,

    2

    1H)p(x)logp(xtoss)coinH(fair

    22

    iXx

    2i

    i

      

     +=

     

      

     =−= ∑

    Entropy (or information content)

    H(a,b) = entropy ifprobability of success = a and

    probability of failure = b

  • 8/17/2019 lect6.pdf

    29/66

    Essential Information Theory

    Developed by Shannon in the 40s

    Notion of entropy (information content): How informative is a piece of information?

    If you already have a good idea about the answer

     

    29

     

    then a "hint" about the right answer is not veryinformative…

    If you have no idea about the answer (ex. 50-50 split)

    high entropy

    then a "hint" about the right answer is veryinformative…

  • 8/17/2019 lect6.pdf

    30/66

    Entropy

    Let X be a discrete random variable (RV)

    Entropy (or information content)

     

    ∑=

    −=

    n

    1ii2i )p(x)logp(xH(X)

    30

    measures the amount of information in a RV average uncertainty of a RV

    the average length of the message needed to transmit anoutcome xi of that variable

    measured in bits

    for only 2 outcomes x1 and x2, then 1 ≥ H(X) ≥ 0

  • 8/17/2019 lect6.pdf

    31/66

  • 8/17/2019 lect6.pdf

    32/66

    Choosing the Best Feature (con't) The "discriminating power" of an attribute A given a data set S

    Let Values(A) = the set of values that attribute A can take

    Let Sv = the set of examples in the data set which have value v forattribute A (for each value v from Values(A) )

    information gain (orentropy reduction)

    A)|H(SH(S)A)gain(S,   −=

    32

    ( )vvalues(A)v

    v SHxS

    SH(S)  ∑∈

    −=

  • 8/17/2019 lect6.pdf

    33/66

    Some Intuition

    Size Color Shape Output

    Big Red Circle +

    Small Red Circle +Small Red Square -

    Big Blue Circle -

     

    33

      . .

    smallest information gain)

    Shape and color are the most discriminating

    attributes (i.e. highest information gain)

  • 8/17/2019 lect6.pdf

    34/66

    A Small Example (1)

    Color

    red: 2+ 1- blue: 0+ 1-

    Size Color Shape Output

    Big Red Circle +

    Small Red Circle +

    Small Red Square -

    Big Blue Circle -

     

    121

    log21

    21

    log21

    H(S) 22   = 

      

     +−=

    ( )vor)values(Colv

    v SHxS

    SH(S)Color)gain(S,   ∑

    −=

    Values(Color) = {red,blue}

    34

    ( )

    0.31150.6885-1Color)|H(S-H(S))gain(Color

     

    0.6885(0)4

    1(0.918)4

    3Color)|H(S

    011

    log11

    1,0Hblue)Color|H(S 

    0.91831log

    31

    32log

    32

    31,

    32Hred)Color|H(S 

    ora ues ooveacor

    2

    22

    ===

    =+=

      

     −===

      

     +−=

     

      

     ==

  • 8/17/2019 lect6.pdf

    35/66

    A Small Example (2)

    Note: by definition,

    Log 0 = -∞

     

    Shape

    circle: 2+ 1- square: 0+ 1-

    Size Color Shape Output

    Big Red Circle +

    Small Red Circle +

    Small Red Square -Big Blue Circle -

    35

     

    0.31150.918-1Shape)|H(S-H(S))gain(Shape

    0.918(0)4

    1(0.918)

    4

    3Shape)|H(S

    12log22log2H(S) 22

    ===

    =+=

    =     +−=

  • 8/17/2019 lect6.pdf

    36/66

    A Small Example (3)

    Size

    big: 1+ 1- small: 1+ 1-

    Size Color Shape Output

    Big Red Circle +

    Small Red Circle +

    Small Red Square -Big Blue Circle -

    36

    01-1Size)|H(S-H(S)gain(Size)

    1(1)2

    1(1)

    2

    1Size)|H(S

    121log2121log21H(S) 22

    ===

    =+=

    =  

     +−=

  • 8/17/2019 lect6.pdf

    37/66

    A Small Example (4)Size Color Shape Output

    Big Red Circle +

    Small Red Circle +

    Small Red Square -Big Blue Circle -

     

    37

    So first separate according to either color or shape(root of the tree)

    0gain(Size)

    0.3115)gain(Color0.3115gain S ape

    =

    =

    =

  • 8/17/2019 lect6.pdf

    38/66

  • 8/17/2019 lect6.pdf

    39/66

    The Restaurant Example

    0.541bits...44log

    44 

    40log

    40x

    124

    22log

    22

    20log

    20-x

    1221

    6

    4,

    6

    2Hx

    12

    6

    4

    4,

    4

    0Hx

    12

    4

    2

    2,

    2

    0Hx

    12

    21gain(pat)

    2222   ≈  

       +

      

       +−+

      

       +−=

     

      

      

      

     +

     

      

     +

     

      

     −=

    ...=gain(alt)   ...=gain(bar)   ...=gain(fri)   ...=gain(hun)

    ...=)gain(price   ...=gain(rain)   ...=gain(res)

    39

    Attribute pat (Patron) has the highest gain, so root of the

    tree should be attribute Patrons  do recursively for subtrees

    bits04

    2,4

    2Hx12

    4

    4

    2,4

    2Hx12

    4

    2

    1,2

    1Hx12

    2

    2

    1,2

    1Hx12

    21gain(type)   = 

     

     

     

     

     

     

     + 

     

     

     + 

     

     

     + 

     

     

     −=

    ...=gain(est)

  • 8/17/2019 lect6.pdf

    40/66

    Decision Boundaries ofDecision Trees

    Feature 1

    Feature 2

    40

  • 8/17/2019 lect6.pdf

    41/66

    Decision Boundaries ofDecision Trees

    Feature 1Feature 2 > t1

    t1 Feature 2

    ??

    41

  • 8/17/2019 lect6.pdf

    42/66

    Decision Boundaries ofDecision Trees

    Feature 1Feature 1 > t1

     

    t1 Feature 2

    Feature 2 > t2

    ??

    42

  • 8/17/2019 lect6.pdf

    43/66

    Decision Boundaries ofDecision Trees

    Feature 1

    Feature 2 > t1

    Feature 2 > t2

    t1t3 Feature 2 Feature 2 > t3

    43

  • 8/17/2019 lect6.pdf

    44/66

    Applications of Decision Trees

    One of the most widely used learning methods in

    practice Fast, simple, and traceable

    44

    an ou -per orm uman exper s n many pro ems A study for diagnosing breast cancer had humans

    correctly classifying the examples 65% of the time;the decision tree classified 72% correctly

    Cessna designed an airplane flight controller using90,000 examples and 20 attributes per example

  • 8/17/2019 lect6.pdf

    45/66

    Today

    Last time: Probabilistic Methods

    Decision trees

    ( Evaluation Unsupervised learning)

     

    45

    Next time: Neural Networks Next time: Genetic Algorithms

  • 8/17/2019 lect6.pdf

    46/66

    E l h d l

  • 8/17/2019 lect6.pdf

    47/66

    Evaluation Methodology

    Standard methodology:1. Collect a large set of examples (all with correct classifications)

    2. Divide collection into two disjoint sets: training set and test set3. Apply learning algorithm to training set

    DO NOT LOOK AT THE TEST SET !

     

    47

    . easure per ormance w e es se

    how well does the function you learn in 3. correctly classify theexamples in the test set? (ie. Compute accuracy)

    Important: keep the training and test sets disjoint!

    To study the robustness of the algorithm, repeat steps2-4 for different training sets and sizes of training sets

  • 8/17/2019 lect6.pdf

    48/66

    Error analysis

     correct class

    (that should have

    been assigned)

    classes assigned by the learner

    Where did the learner go wrong ?

    Use a confusion matrix / contingency table

    48

      C1 C2 C3 C4 C5 C6 … Total

    C1 99.4 .3 0 0 .3 0 100

    C2 0 90.2 3.3 4.1 0 0 100

    C3 0 .1 93.9 1.8 .1 1.9 100

    C4 0 .5 2.2 95.5 .2 0 100

    C5 0 0 .3 1.4 96.0 2.5 100C6 0 0 1.9 0 3.4 93.3 100

  • 8/17/2019 lect6.pdf

    49/66

    A Learning Curve

    49

    Size of training set the more, the better

    but after a while, not much improvement…

    source: Mitchell (1997)

    Some Words on Training

  • 8/17/2019 lect6.pdf

    50/66

    Some Words on Training

    In all types of learning… watch out for:

    Noisy input Overfitting/underfitting the training data

    50

  • 8/17/2019 lect6.pdf

    51/66

    Overfitting / Underfitting

  • 8/17/2019 lect6.pdf

    52/66

    Overfitting / Underfitting

    In all types of learning… watch out for:

    Overfitting the training data: If large number of irrelevant features are there, we may find

    meaningless regularities in the data that are particular to thetraining data but irrelevant to the problem.

     

    52

    . .

    complicated model with many features while the real solution ismuch simpler

    Underfitting the training data: The training set is not representative enough of the real problem

    i.e. You find a simple explanation with few features, while thereal solution is more complex

    C lid ti

  • 8/17/2019 lect6.pdf

    53/66

    Cross-validation another method to reduce overfitting

    K-fold cross-validation run k experiments, each time you test on 1/k of the data, and train on the rest

    than you average the results

    ex: 10-fold cross validation1. Collect a large set of examples (all with correct classifications)

    2. Divide collection into two disjoint sets: training (90%) and test (10% = 1/k)

     

    53

    . pp y earn ng a gor m o ra n ng se

    4. Measure performance with the test set5. Repeat steps 2-4, with the 10 different portions

    6. Average the results of the 10 experiments

    exp1: train test

    exp2: train test trainexp3: train test train

    … …

    Today

  • 8/17/2019 lect6.pdf

    54/66

    Today

    Last time: Probabilistic Methods

    Decision trees

    ( Evaluation Unsupervised learning)

     

    54

    Next time: Neural Networks Next time: Genetic Algorithms

    Unsupervised Learning

  • 8/17/2019 lect6.pdf

    55/66

    Unsupervised Learning

    Learn without labeled examples i.e. X is given, but not f(X)

    Without a f(X), you can't really identify/label atest instance

    small nose big teeth small eyes moustache f(X) = ?

    55

     

    But you can: Cluster/group the features of the test data into a

    number of groups Discriminate between these groups without

    actually labeling them

    Clustering

  • 8/17/2019 lect6.pdf

    56/66

    Clustering

    Represent each instance as a vector

    Each vector can be visually represented in a n dimensional

    space a1 a2 a3 OutputX1 1 0 0 ?

    X2 1 6 0 ?X 

    56

    X 2 

    X 1 X 3 

    X3 8 0 1 ?

    X4 6 1 0 ?

    X5 1 7 1 ?

    X 4 

    Clustering

  • 8/17/2019 lect6.pdf

    57/66

    Clustering

    Clustering algorithm

    Represent test instances on a n dimensional space

    Partition them into re ions of hi h densit

    57

     

    How? … many algorithms (ex. k-means) Compute the centroïd of each region as the

    average of data points in the cluster

    k means Clustering

  • 8/17/2019 lect6.pdf

    58/66

    k-means Clustering

    User selects how many cluster they want… (the value of k)

    1. Place k points into the space (ex. at random).These points represent initial group centroïds.

     

    58

    .   n   .

    3. When all data points have been assigned,recalculate the positions of the K centroïds as theaverage of the cluster

    4. Repeat Steps 2 and 3 until none of the datainstances change group.

    Euclidean Distance

  • 8/17/2019 lect6.pdf

    59/66

    Euclidean Distance

    10

    6

    7

    8

    9

    To find the nearestcentroïd…

    a possible metric is the

    Euclidean distance distance between 2 pts

    p = (p1, p2, ....,pn) = ....

    59

    1 2 3 4 5 6 7 8 9 10

    1

    2

    3

    4

    where to assign a data pointx?

    For all k clusters, chose theone where x has the smallestdistance

    ( )∑=

    −=

    n

    1i

    2ii qpd

  • 8/17/2019 lect6.pdf

    60/66

    Example

  • 8/17/2019 lect6.pdf

    61/66

    3

    4

    5

    Examplepartition data points to closest centroïd

    c1

    61

    0

    1

    2

    0 1 2 3 4 5

    c2

    c3

    Example

  • 8/17/2019 lect6.pdf

    62/66

    3

    4

    5

    Examplere-compute new centroïds

    c1

    62

    0

    1

    2

    0 1 2 3 4 5

    c2c3

    Example

  • 8/17/2019 lect6.pdf

    63/66

    3

    4

    5

    Examplere-assign data points to new closest centroïds

    c1

    63

    0

    1

    2

    0 1 2 3 4 5

    c2c3

    Example

  • 8/17/2019 lect6.pdf

    64/66

    Example

    c1

    3

    4

    5

    64

    c2 c3

    0

    1

    2

    0 1 2 3 4 5

    Notes on k-means

  • 8/17/2019 lect6.pdf

    65/66

    Notes on k means

    converges very fast!

    BUT: very sensitive to initial choice of centroids

     

    65

      …

    user must set initial k not easy to do…

    many other clustering algorithms…

    Today

  • 8/17/2019 lect6.pdf

    66/66

    Today

    Last time: Probabilistic Methods

    Decision trees

    ( Evaluation Unsupervised learning)

     

    66

    Next time: Neural Networks Next time: Genetic Algorithms