russell stuart, norvig peter, artificial intelligence: a modern approach, 1995 18 learning from...

Russell Stuart, Norvig Peter, Artificial Intelligence: A Modern Approach, 1995

18 LEARNING FROM OBSERVATIONS

A learning agent can be divided into four conceptual components:

• The most important distinction is between the learning element, which is responsible for making improvements, and the performance element, which is responsible for selecting external actions.

• The critic is designed to tell the learning element how well the agent is doing.

• The problem generator is responsible for suggesting actions that will lead to new and informative experiences.


Figure 18.1 A general model of learning agents.


The design of the learning element is affected by four major issues:

• Which components of the performance element are to be improved.

• What representation is used for those components.

• What feedback is available.

• What prior information is available.


Representation of the components• Any of these components can be represented using any of

the representation schemes in this book. We have seen several examples: deterministic descriptions such as linear weighted polynomials for utility functions in game-playing programs and propositional and first-order logical sentences for all of the components in a logical agent; and probabilistic descriptions such as belief networks for the inferential components of a decision- theoretic agent. Effective learning algorithms have been devised for all of these. The details of the learning algorithm will be different for each representation, but the main idea remains the same.


Available feedback

• supervised learning

• reinforcement learning

• unsupervised learning


Bringing it all togetherEach of the seven components of the performance element can be

described mathematically as a function: • information about the way the world evolves can be described

as a function from a world state (the current state) to a world state (the next state or states);

• a goal can be described as a function from a state to a Boolean value (0 or 1) indicating whether the state satisfies the goal.

The key point is that all learning can be seen as learning the representation of a function.

• We can choose which component of the performance element to improve and how it is to be represented.


18.2 INDUCTIVE LEARNING

• In supervised learning, the learning element is given the correct (or approximately correct) value of the function for particular inputs, and changes its representation of the function to try to match the information provided by the feedback.

• An example is a pair (x,f(x)), where x is the input and f(x) is the output of the function applied to x.

• The task of pure inductive inference (or induction): given a collection of examples off, return a function h that approximates f. The function h is called a hypothesis.


Figure 18.2 In (a) we have some example (input,output) pairs.

In (b), (c), and (d) we have three hypotheses for functions from which these examples could be drawn.


18.3 LEARNING DECISION TREES

• Decision tree induction is one of the simplest and yet most successful forms of learning algorithm. It serves as a good introduction to the area of inductive learning, and is easy to implement.

• We first describe the performance element, and then show how to learn it. Along the way, we will introduce many of the ideas and terms that appear in all areas of inductive learning.


Decision trees as performance elements

• A decision tree takes as input an object or situation described by a set of properties, and outputs a yes/no "decision." Decision trees therefore represent Boolean functions. Functions with a larger range of outputs can also be represented, but for simplicity we will usually stick to the Boolean case. Each internal node in the tree corresponds to a test of the value of one of the properties, and the branches from the node are labelled with the possible values of the test. Each leaf node in the tree specifies the Boolean value to be returned if that leaf is reached.

• the goal predicate (goal concept)


Expressiveness of decision trees

• If decision trees correspond to sets of implication sentences, a natural question is whether they can represent any set.

• Decision trees are fully expressive within the class of propositional languages, that is, any Boolean function can be written as a decision tree.

For some kinds of functions, this is a real problem.

• parity function,

• majority function,


Figure 18.4 A decision tree for deciding whether to wait for a table.


Inducing decision trees from examples

• An example is described by the values of the attributes and the value of the goal predicate. We call the value of the goal predicate the classification of the example. If the goal predicate is true for some example, we call it a positive example; otherwise we call it a negative example. A set of examples X1,…, X12 for the restaurant domain is shown in Figure 18.5. The positive examples are ones where the goal WillWait is true (X1, X3,....) and negative examples are ones where it is false (X2, X5, ...).

• The complete set of examples is called the training set.


Figure 18.5 Examples for the restaurant domain.

Example Attributes GoalAlt Bar Fri Hun Pat Price Rain Res Type Est WillWait

X 1 Yes No No Yes Some $$$ No Yes French 0-10 YesX 2 Yes No No Yes Full $ No No Thai 30-60 NoX 3 No Yes No No Some $ No No Burger 0-10 YesX 4 Yes No Yes Yes Full $ No No Thai 10-30 YesX 5 Yes No Yes No Full $$$ No Yes French >60 NoX 6 No Yes No Yes Some $$ Yes Yes Italian 0-10 YesX 7 No Yes No No None $ Yes No Burger 0-10 NoX 8 No No No Yes Some $$ Yes Yes Thai 0-10 YesX 9 No Yes Yes No Full $ Yes No Burger >60 NoX 10 Yes Yes Yes Yes Full $$$ No Yes Italian 10-30 NoX 11 No No No No None $ No No Thai 0-10 NoX 12 Yes Yes Yes Yes Full $ No No Burger 30-60 Yes

A set of examples X1,…, X12 for the restaurant domain is shown in Figure 18.5. The positive examples are ones where the goal WillWait is true (X1, X3,....) and negative examples are ones where it is false (X2, X5, ...).


Figure 18.6 Splitting the examples by testing on attributes.

In (a) Patrons is a good attribute to test first;

in (b) Type is a poor one; and

in (c) Hungry is a fairly good second test, given that Patrons is the first test.


Figure 18.8 The decision tree induced from the 12-example training set.


Assessing the performance of the learning algorithm

• A learning algorithm is good if it produces hypotheses that do a good job of predicting the classifications of unseen examples.

1. how prediction quality can be estimated in advance.

2. a methodology for assessing prediction quality after the fact.

• Obviously, a prediction is good if it turns out to be true, so we can assess the quality of a hypothesis by checking its predictions against the correct classification once we know it. We do this on a set of examples known as the test set.


If we train on all our available examples, then we will have to go out and get some more to test on, so often it is more convenient to adopt the following methodology:

1. Collect a large set of examples.

2. Divide it into two disjoint sets: the training set and the test set.

3. Use the learning algorithm with the training set as examples to generate a hypothesis H.

4. Measure the percentage of examples in the test set that are correctly classified by H.

5. Repeat steps 1 to 4 for different sizes of training sets and different randomly selected training sets of each size.


18.4 USING INFORMATION THEORY• Information theory uses this same intuition, but instead of

measuring the value of information in dollars, it measures information content in bits. One bit of information is enough to answer a yes/no question about which one has no idea, such as the flip of a fair coin.

• The information gain from the attribute test is defined as the difference between the original information requirement and the new requirement:

• Gain(A) = I (p/(p+n), n/(p+n)) - Remainder(A)

• Remainder(A) = (pi+ni)/(p+n) I (pi/(pi+ni), ni/(pi+ni)

and the heuristic used in the CHOOSE-ATTRIBUTE function is just to choose the attribute with the largest gain.


18.5 LEARNING GENERAL LOGICAL DESCRIPTIONS

• Inductive learning can be viewed as a process of searching for a good hypothesis in a large space - the hypothesis space - defined by the representation language chosen for the task.

• Each hypothesis proposes such an expression, which we call a candidate definition of the goal predicate.

• the extension of the predicate• An example can be a false negative for the hypothesis, if the

hypothesis says it should be negative but in fact it is positive. • An example can be a false positive for the hypothesis, if the

hypothesis says it should be positive but in fact it is negative. • The hypothesis and the example are therefore logically

inconsistent.


Figure 18.10 (a) A consistent hypothesis. (b) A false negative. (c) The hypothesis is generalized. (d) A false positive. (e) The hypothesis is specialized.

Current-best-hypothesis search(generalization, specialization)


Least-commitment search

• The set of hypotheses remaining is called the version space, and

• the learning algorithm is called the version space learning algorithm (also the candidate elimination algorithm).

• it is incremental

• It is also a least-commitment algorithm because it makes no arbitrary choices


Figure 18.13 The version space contains all hypotheses consistent with the examples.


Figure 18.14 The extensions of the members of G and S. No known examples lie in between.


18.6 WHY LEARNING WORKS: COMPUTATIONAL LEARNING

THEORY• Learning means behaving better as a result of experience.

• computational learning theory - a field at the intersection of AI and theoretical computer science.

• error(h) = P(h(x) f(x)x drawn from D)


Figure 18.15 Schematic diagram of hypothesis space, showing the "-ball" around the true function f.

error(h) = P(h(x) f(x)x drawn from D)


Figure 18.16 A decision list for the restaurant problem.

Learning decision lists

x WillWait(x) Patrons(x, Some) (Patrons(x, Full) Fri/Sat(x))


Figure 18.18 Graph showing the predictive performance of the DECISION-LIST-LEARNING algorithm on the restaurant data, as a function of the number of examples seen. The curve for DECISION-TREE-LEARNING is shown for comparison


18.7 SUMMARYAll learning can be seen as learning a function, and in this chapter we

concentrate on induction: learning a function from example input/output pairs. The main points were as follows:

• Learning in intelligent agents is essential for dealing with unknown environments (i.e., compensating for the designer's lack of omniscience about the agent's environment).

• Learning is also essential for building agents with a reasonable amount of effort (i.e., compensating for the designer's laziness, or lack of time).

• Learning agents can be divided conceptually into a performance element, which is responsible for selecting actions, and a learning element, which is responsible for modifying the performance element.


• Learning takes many forms, depending on the nature of the performance element, the available feedback, and the available knowledge.

• Learning any particular component of the performance element can be cast as a problem of learning an accurate representation of a function.

• Learning a function from examples of its inputs and outputs is called inductive learning.

• The difficulty of learning depends on the chosen representation. Functions can be represented by logical sentences, polynomials, belief networks, neural networks, and others.

• Decision trees are an efficient method for learning deterministic Boolean functions.


• Ockham's razor suggests choosing the simplest hypothesis that matches the observed examples. The information gain heuristic allows us to find a simple decision tree.

• The performance of inductive learning algorithms is measured by their learning curve, which shows the prediction accuracy as a function of the number of observed examples.

• We presented two general approaches for learning logical theories. The current-best-hypothesis approach maintains and adjusts a single hypothesis, whereas the version space approach maintains a representation of all consistent hypotheses. Both are vulnerable to noise in the training set.

• Computational learning theory analyses the sample complexity and computational complexity of inductive learning. There is a trade-off between the expressiveness of the hypothesis language and the ease of learning.

russell stuart, norvig peter, artificial intelligence: a modern approach, 1995 18 learning from...

Documents