inductive learning by machines

28
STUART RUSSELL INDUCTIVE LEARNING BY MACHINES* (Received 1 May, 1990) The field of machine learning is concerned with methods for the automatic improvement of the performance of computer programs. When those computer programs are knowledge-based systems, the machine learning problem becomes that of inductive theory formation. I describe the progress that has been made in solving this problem, yielding algorithms that are capable in principle of learning a very large class of theories. I argue that the key to further progress lies in the effective use of prior knowledge to reduce the complexity of the search for useful new hypotheses. The implications of an algorithmic perspec- tive for some of the debates that arise in the philosophy of science seem to be quite significant. 1. INTRODUCTION Machine learning researchers aim to achieve what most lay persons find hardest to accept about artificial intelligence: that machines actually improve their behaviour with experience. The idea of going beyond the codified capabilities of the programmer makes the field both exciting and puzzling -- puzzling because one naturally wants to know where some new skill comes from, if it was not provided by the machine's creator. The growth of knowledge arises, of course, from interaction with a contingent environment. The ultimate goal of machine learning is to understand and implement algorithmic methods for improving a system's performance towards the theoretical maximum allowed by its resource limitations, whatever the environment in which the system finds itseff. What a machine does with its experience depends on the nature of its performance element, the mechanism for selecting actions. Most people in artificial intelligence (AI) propose performance elements Philosophical Studies 64: 37--64, 1991. 1991 KluwerAcademic Publishers, Printed in the Netherlands.

Upload: stuart-russell

Post on 10-Jul-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Inductive learning by machines

STUART RUSSELL

I N D U C T I V E L E A R N I N G BY M A C H I N E S *

(Received 1 May, 1990)

The field of machine learning is concerned with methods for the automatic improvement of the performance of computer programs. When those computer programs are knowledge-based systems, the machine learning problem becomes that of inductive theory formation. I describe the progress that has been made in solving this problem, yielding algorithms that are capable in principle of learning a very large class of theories. I argue that the key to further progress lies in the effective use of prior knowledge to reduce the complexity of the search for useful new hypotheses. The implications of an algorithmic perspec- tive for some of the debates that arise in the philosophy of science seem to be quite significant.

1. INTRODUCTION

Machine learning researchers aim to achieve what most lay persons find hardest to accept about artificial intelligence: that machines actually improve their behaviour with experience. The idea of going beyond the codified capabilities of the programmer makes the field both exciting and puzzling -- puzzling because one naturally wants to know where some new skill comes from, if it was not provided by the machine's creator. The growth of knowledge arises, of course, from interaction with a contingent environment. The ultimate goal of machine learning is to understand and implement algorithmic methods for improving a system's performance towards the theoretical maximum allowed by its resource limitations, whatever the environment in which the system finds itseff.

What a machine does with its experience depends on the nature of its performance element, the mechanism for selecting actions. Most people in artificial intelligence (AI) propose performance elements

Philosophical Studies 64: 37--64, 1991. �9 1991 KluwerAcademic Publishers, Printed in the Netherlands.

Page 2: Inductive learning by machines

38 STUART RUSSELL

based on explicitly-represented knowledge -- so-called knowledge- based systems. These programs reason about the state of the world, and about the potential results of actions, using sentences in some formal language such as first-order predicate calculus. The learning problem therefore involves creating and revising formally-represented theories using a stream of concrete observations. This activity corresponds well to the idealization of scientific research typically studied by analytic philosophers of science, although its domain of application is broader.

Machine learning therefore involves a broad spectrum of philosophi- cal, mathematical and technical problems. Designing or evaluating a computer program for theory formation would at first sight seem to entail adopting a rationalist perspective on the processes of theory formation and comparison, since one hopes to be able to give reasons why one program is better than another. Furthermore, one has to tackle head-on the problem of generating new theories, despite Newton- Smith's assertion ([Newton-Smith, 1981], p. 125) that "most philoso- phers hold that . . . there is no systematic, useful study of theory construction or discovery". Machine learning programs have actually been doing some of the things that scientists do. Buchanan and Mitchell [Buchanan and Mitchell, 1978] wrote a program called Meta-DEN- DRAL that successfully induced molecular cleavage rules, subsequently published in a journal of chemistry, from raw data taken from a mass spectroscope. While the basic inductive framework, derived as it was from Buchanan's background as a philosopher, proved adequate to accomplish a real, if very 'normal', piece of science, it was clear that quite a gap remained between what was formalized in the framework and what had to be done to get the system to produce reasonable hypotheses. Recent developments, described below, have begun to bridge this gap.

The body of the paper is divided into three main sections. Section 2 describes a simple learning scenario that includes the aspects of theory formation we would like to study, and goes on to caricature the historical developments in machine learning, as the formal problems considered gradually increased in scope. The problem of computational complexity is seen to be central. Section 3 shows how prior knowledge can come to the aid of a learning system in understanding new observa- tions, and describes some recent knowledge-based learning methods.

Page 3: Inductive learning by machines

I N D U C T I V E L E A R N I N G BY M A C H I N E S 39

IDDI Figure 1. Observer's view of the learning scenario

Section 4 attempts to connect the foundational debates in the philoso- phy of science, such as I have been able to discern, to basic issues in machine learning. I conclude with an evaluation of prospects for research on questions of common interest to both the machine learning and the philosophical communities.

The paper is intended primarily for a philosophical audience, although one can expect the philosophical portion to be naive and the machine learning material to be fiendishly technical. I assume some familiarity with formal logic, and more familiarity with the structure of intelligent systems than I should.

2. M A C H I N E L E A R N I N G : A S C E N A R I O A N D A B R I E F

S U M M A R Y

To provide an illustrative context, I'll introduce a simple learning scenario from the RALPH (Rational Agents with Limited Performance Hardware) project at Berkeley. The system consists of a simulated world with its own "laws of physics", and a number of "agents" (or ralphs) that inhabit the world. These agents are provided with the appropriate sensory inputs by the simulator, and operate in a simple

Page 4: Inductive learning by machines

40 STUART RUSSELL

"sense-act" cycle, selecting actions that are carried out on the world by the simulator. Figure 1 shows a typical situation, in which one agent is trying to find food while others aim to chase down and inflict injury upon him. From the agent's point of view, as opposed to the observer's, the epistemie situation is greatly impoverished, and is well described by what Rivest [Rivest and Schapire, 1987] has called the "buttons and lights" model: the sensory inputs consist of a row of (uninterpreted) lights that can be on or off, and the available actions are represented by a row of (uninterpreted) buttons to push. 1 In RALPH worlds, the lights usually represent portions of a "visual field" 90 degrees wide in front of the agent, and can have a scalar in intensity and a discrete colour. The visual array is of course one-dimensional since the agent is embedded in a two-dimensional world. Actions usually include moving forward, turning left or right, picking up an object, eating it, and so on. We extend Rivest's model in one important direction by adding an inter- preted utility input -- interpreted in the sense that the agents are designed with the aim of maximizing the average value of the input. For such an agent, the generation of true theories of the world is (perhaps) a subgoal of utility maximization. The agent is not designed as a purely disinterested seeker after truth.

Some methodological aspects of such a setup are worth mentioning. First, it is designed for easy replacement of the "laws of physics" and the agents' sensorimotor equipment. For example, we could add non- determinism to the behaviour of the chasing agents or to the effects of eating various objects; we could give agents a sense of smell -- an input scalar proportional to the number of smell-producing objects in the vicinity; or we could have the visual field respond to non-contiguous parts of the environment. This enables us to separate world-specific from world-independent aspects of learning behaviour, and to demon- strate the generality of the programs we design. Second, we are forced to model, even if in caricature, the entire process of theory formation, rather than studying, for example, the comparison of theories ex

machina on the basis of isolated, sanitized, abstracted and verbalized "observations". Furthermore, it is not given in advance what the agent is to form a theory about. In fact, these learning goals will come about from "research requests" produced in the process of action selection. For example, the agent will need to know what happens when a certain

Page 5: Inductive learning by machines

I N D U C T I V E L E A R N I N G BY M A C H I N E S 41

button is pushed, so to speak. Such "situatedness" has important consequences for learning, since it means, among other things, that the agent will seldom want to learn a predictive law for a predicate about which it has no other knowledge. Unfortunately, many studies in machine learning and inductive logic have assumed the opposite.

Be that as it may, I'll start with a description of early concept- learning programs. The concept-learning paradigm assumes a subject presented with descriptions of instances accompanied by their classifi- cation according to membership of an unknown concept. The concept is unknown in the sense that the subject initially has no way to predict concept membership from instance descriptions alone; of course, whether the presentation of examples is done by the natural environ- ment or by a teacher, the subject has a definition for the concept in that the classifications of the examples can be perceived. The subject's task is to induce an operational definition for the concept, one that can be used successfully to predict membership for new instances. In our scenario, one useful concept to learn might be what the observer would call "situations in which the agent will be bitten by an enemy". The need to learn this concept arises when the agent notices sudden, unexplained reductions in its utility.

The first problem is to understand what might constitute appropriate instance descriptions. Although it's reasonable to make the metaphysi- cal assumption that each state transition of the world itself is a (perhaps nondeterministic) function of the previous state, the agent's sensory input is only a partial projection of the world state, so to ensure that a correct law can be learnt the instance descriptions should include as much of the agent's observational history as possible, up to the point at which the biting takes place. Even then, this might not be enough; in fact, an agent will get bitten when an enemy is adjacent to and facing the agent, but the agent's sensory input only registers the presence of the enemy, not its orientation. 2

The consequence of generosity in including information in instance descriptions is an unacceptably slow rate of learning. A concept learning program can be viewed as searching a hypothesis space of possible definitions for one that is consistent with all the observations. These hypotheses are, at least in simple systems, built out of elements in the instance descriptions. When there are many possible elements to

Page 6: Inductive learning by machines

42 STUART RUSSELL

consider, the number of possible hypotheses becomes enormous. For example, with instances described by just five Boolean (yes/no) attri- butes, more than four billion distinct hypotheses can be proposed. The response of the machine learning community prior to the early '80s was to take a pragmatic line. Each machine learning experiment was carried out in a specific context, where the programmer could identify the relevant features and present the machine with instance descriptions perfectly tailored for learning the concept the programmer had in mind. In our case, for example, we could restrict the attention of the learning program to five computed features -- whether or not an enemy is in each of the four adjacent squares, and whether or not the agent decides to move at this juncture. The Meta-DENDRAL program mentioned above took as its instance descriptions just the topological structural formula of the molecules in question, ignoring isotope and other information, as well as making the grand assumption that properties of the molecule itself accounted for its cleavage processes.

As well as restricting the instance descriptions, the programmer can restrict the language used to express concept definitions. For example, by allowing only conjunctive definitions, the five-feature language drops from four billion concepts to just 243. 3 Judicious use of such syntactic preferences seemed to be a sine qua non of inductive success, and characterized many of the research efforts in machine learning in the 1970's. Even with syntactic and descriptive restrictions, however, the search problems in learning can be extremely complicated, and several ingenious algorithms were devised [Mitchell, 1982].

Theoretical studies of learning in computer science were limited, before 1984, to analysis of the possibility of converging on a correct hypothesis given sufficient data. "Identification in the limit" [Gold, 1967, Blum and Blum, 1975] was the criterion for success, which meant that the program would propose a current-best-hypothesis after each observation, and after some finite number of observations would produce, and stick with, a correct theory. Of course, as many have pointed out, the program will never know it has a true theory. Learning in this sense requires that the language be recursively enumerable (that is, a program can enumerate all expressions in the language) and have the property that consistency between hypotheses and observations be decidable. Given these conditions, a 'Popperian' program can enumer-

Page 7: Inductive learning by machines

I N D U C T I V E L E A R N I N G BY M A C H I N E S 43

ate the possible hypotheses one by one, adopting the first hypothesis it finds that is consistent with the observed instances, and then continuing the enumeration when that hypothesis is refuted by subsequent obser- vation. As a theoretical model of science, this has its attractions, but mainly for mathematicians.

A more interesting model was proposed by Variant [Valiant, 1984] in 1984. In his frame-work, a successful algorithm need not guarantee exact identification of the true theory, but need only produce a prob- ably approximately correct (PAC) theory. A concept language is PAC- learnable only if an algorithm exists that requires a number of observed examples that is polynomial in the size of the instance descriptions, and requires only a polynomial number of computation steps to produce its hypothesis. The distinction between polynomial and exponential com- putations is a fundamental one: most problems can be solved in an uninteresting way by enumerating and testing all possible solutions, whose number will typically be exponential in the size of the problem description. PAC-learnability thus requires that the program be able to take significant advantage of the information in the observations to construct in a sense, a consistent hypothesis. The distinction between constructive use of information and blind enumeration turns out to be crucial to the possibility of inductive learning.

The Valiant model and the field of computational learning theory it has inspired ]Warmuth, 1989] contain several philosophically interest- ing ideas. The idea of approximate correctness of a theory has obvious connections to Popper's idea of verisimilitude [Popper, 1968a], which ran into serious difficulties [Miller, 1974, Tichy, 1974]. The problem with verisimilitude is that one is trying to establish closeness to truth with only observational success to go on. Future observations might bring up aspects of the world not previously exhibited, and no theory could expect to predict them. Valiant's move is to base an attribution of approximate correctness on the following assumption: that the future experiences that test the theory are drawn from the same probability distribution as was used to generate the examples from which the theory was induced. Such an assumption of non-malignity seems quite reasonable, and provides us with a good way to assess verisimilitude. A second interesting point is the emphasis on predictive success rather than on the identification of necessary and sufficient conditions in

Page 8: Inductive learning by machines

44 S T U A R T R U S S E L L

agreement with the "true theory". Valiant explicitly discusses the kind of learning by ostension proposed by Quine, and produces a model that the later Wittgenstein would find less objectionable than most. The final point, and perhaps the most important, is the emphasis on computa- tional complexity. Although PAC-learning researchers have focused on discovering which classes of concepts are easy or hard to learn, the notion of computational complexity has significance for the more general project of finding a rational procedure for doing science. Philosophers of science have divided criteria for theory selection into internal and external factors. The former consist of the theories under consideration and the relevant evidence; the latter are therefore "psy- chological and social factors" [Newton-Smith, 1981]. Similarly Suppes [Suppes, 1968] seems to argue that only psychological considerations can be cited against vacuous theories of identification by enumeration. What's missing is the middle ground: considerations of the intrinsic computational complexity of inductive inference are a valid additional factor in determining rational inductive behaviour, whether we are talking about ideal science, machine learning systems or flesh-and- blood scientists. Hypotheses that can only be produced by intractable procedures should not be sought if another hypothesis is immediately available and otherwise comparable in quality.

Despite several advantages, the PAC-learning model is still far from adequate. Guarantees on the probable approximate success of a learn- ing algorithm rely on the assumption that the hypothesis space being searched contains a correct theory. From the agent's point of view, this condition is hard to fulfill unless all restrictions on the language and instance descriptions are removed, a move which gets us back in the complexity soup. Furthermore, there seems little chance of positive results for the more complex languages (such as first order predicate calculus) that are needed for non-trivial theories. The problem seems to be that learning from scratch is just too hard. An incremental approach might be more successful.

3. P R I O R K N O W L E D G E G E T S I N T O T H E A C T

In all of the approaches described above, the idea is to construct a program that has the input/output behaviour observed in the data.

Page 9: Inductive learning by machines

INDUCTIVE LEARNING BY MACHINES 45

Logically speaking, this means constructing a theory New Knowledge that entails the observed facts : 4

(1) New Knowledge ~ Observations

In a sense, this can be seen as an 'equation' (the word is used loosely here) in which New Knowledge is the 'unknown'. The philosopher's induction problem can be seen as deciding if a given theory is a reasonable solution. We will call this form of the learning problem knowledge-free inductive learning (KFIL for short). Most machine learning systems prior to the early 1980's were in effect trying to solve instances of this problem; typically, ad hoc methods, including syntactic and vocabulary restrictions on theories, were used to make the problem tractable and to ensure the production of interesting explanatory hypotheses.

The picture that is currently fashionable in machine learning is that of an agent that already knows something and is trying to learn some more. This may not sound like a terrifically deep revolution on which to base a new science, but it makes quite a difference to the way in which we write programs. It might also have some relevance to both norma- tive and descriptive studies of science. If there is to be a general explanatory account of learning, it must include a story about how the prior knowledge got there, to be used in the new learning episodes. The answer is that it got there by a learning process. This account is therefore a cumulative or incremental theory. Presumably an agent could start out with nothing, performing inductions in vacuo like a good little KFIL program. But once it has eaten from the Tree of Knowledge, it can no longer pursue such naive speculations, and should use its prior knowledge to learn more and more effectively. The question is then how to actually do this.

3.1. Some obvious examples

Many apparently rational cases of inferential behaviour in the face of observations clearly do not follow the simple "generalization from n instances" rule of the naive inductivist.

�9 Sometimes general conclusions are leapt to after only one observa-

Page 10: Inductive learning by machines

46 S T U A R T R U S S E L L

tion. Gary Larson once drew a cartoon in which a bespectacled caveman is roasting his lizard on the end of a pointed stick, watched by an amazed crowd of his less intellectual contemporaries, who have been using their bare hands to hold their victuals over the fire. Clearly, this single enlightening experience is enough to convince the watchers of a general principle of painless cooking.

�9 Or consider the case of the traveller to Brazil meeting her first Brazilian. On hearing him speak Portuguese, she immediately con- cludes that Brazilians speak Portuguese, yet on discovering that his name is Fernando, she doesn't conclude that all Brazilians are called Fernando. 5

�9 Similar examples appear in science: when a freshman physics student measures the resistivity of a sample of copper, she is quite confident in generalizing that value to all such samples.

�9 Goodman's classic example of green/grue emeralds is another case in point, which he used in [Goodman, 1955] to refute the early claims of the confirmation theorists. In this case, both possible generalizations are supported by the same set of instances, yet one is clearly preferable. This differential behaviour cannot emerge from factors internal to the theories or to the observations, and therefore requires a revision to the simple syntactic model of enumerative induction.

�9 Finally, consider the case of a pharmacologically ignorant but diagnostically sophisticated medical student observing a consulting session between a patient and an expert internist. After a series of questions and answers the expert tells the patient to take a course of a particular antibiotic. The medical student infers the general rule that that particular antibiotic is effective for a particular type of infection.

3.2. Some general schemes

In each of the above examples, one can appeal to prior knowledge to try to justify the generalization chosen. Put simply, given a hypothesis and the direct supporting evidence, a learning agent has to decide what other facts might be relevant and whether or not those facts are the

Page 11: Inductive learning by machines

I N D U C T I V E L E A R N I N G BY M A C H I N E S 47

case. The nature of the relationship between hypothesis, evidence and prior knowledge must surely be syntactic, since the agent can operate only on syntactic representations, not meanings. For first-order logical theories, however, the semantic relation of logical entailment can be realized syntactically. Entailment relations will therefore constitute the organizing principles for understanding the role of prior knowledge.

In the case of lizard-toasting, the cavemen generalize by explaining the success of the pointed stick: it supports the lizard while keeping the hand intact. From this explanation they can infer a general rule: that any long, thin, rigid, sharp object can be used to toast small, soft-bodied edibles. This kind of generalization process has been automated as explanation-based learning (EBL) [De Jong, 1981, Mitchell et al., 19861. The "entailment equation" satisfied by EBL is the following:

Prior Knowledge ~ New Knowledge ~ Observation

Since the second part of the equation looks the same as KFIL, EBL was initially thought to be a better way to do learning. But since it requires that the prior knowledge be sufficient to explain the observa- tion, the agent using it doesn't actually learn anything factual from the instance [Dietterich, 1986]. EBL is now viewed as a method for converting first-principles theories into useful special-purpose knowl- edge; while this is an extremely important aspect of reasoning in science and engineering, it doesn't contribute to a theory of the generation of new knowledge.

The situation of our traveller in Brazil is quite different. For she cannot necessarily explain why Fernando speaks the way he does, unless she knows her Papal bulls. But the same generalization would be forthcoming from a traveller entirely ignorant of colonial history. The freshman physics student would also be hard put to explain the particu- larly value that she discovers for the resistivity of copper. In the traveller's case, the relevant prior knowledge is that, within any given country, most people tend to speak the same language; on the other hand, Fernando is not assumed to be the name of all Brazilians because this kind of regularity does not hold for names. Regularities of this type have been formalized by the author as determinations, relations between predicate schemata with a natural first-order semantics. 6

Page 12: Inductive learning by machines

48 STUART RUSSELL

Determination is expressed by the symbol > , so the above regularity is written

Nationality(x, y) > Language(x, z)

The logical formalization naturally suggests a new entailment equation:

(2) Prior Knowledge + Observations ~ New Knowledge

Again, as with EBL systems, New Knowledge should explain the Observations. Systems satisfying this equation will be called knowledge- based deductive learners (KBDL for short).

Prior knowledge in the form of theories is itself derived from observations. So the role of an equation such as (2) is to locate those prior observations whose epistemological weight can be brought to bear on the induction from the current observation set. The prior observa- tions are connected to the current case by the presence of certain kinds of regularities, that are expressed as theories in Prior Knowledge. In this way we can begin to give structure and operational force to Quine and Ullian's "web of belief" [Quine and Ullian, 1970].

Goodman's informal theory of overhypotheses and entrenchment, designed to explain the difference in projectibility between "all emeralds

are green" and "all emeralds are grue", also invokes prior knowledge:

While confirmation is indeed a relation between evidence and hypotheses, this does not mean that our definition of this relation must refer to nothing other than such evidence and hypotheses. The fact is that whenever we set about determining the vafidity of a given projection from a given base, we have and use a good deal of other relevant knowledge. ([Goodman, 1955] pp. 84--5).

I have shown in [Russell, 1986] that we can give suitable logical definitions to (relative) entrenchment and overhypotheses such that Goodman's patterns of reasoning are revealed as kinds of KBDL.

We have seen that prior knowledge can be used in KBDL to produce useful generalizations from observed instances. The generaliza- tions follow deductively, yet the information in the observations enlarges the system's deductive closure. So does KBDL suffice to account for all reasonable learning behaviour? The very fact that it yields valid generalizations suggests otherwise. Using KBDL ab initio can only yield a knowledge state that is the deductive closure of all the agent's observations. And since observations are conjunctive ground sentences, their deductive closure is just the set of observations itself.

Page 13: Inductive learning by machines

INDUCTIVE LEARNING BY MACHINES 49

Clearly, we don't want to retreat to the KFIL definition. Somehow, the agent's prior knowledge needs to be in the equation, but we need New Knowledge to be on the left-hand side to allow for ampliative inference. We will therefore call a system a knowledge-based inductive learner (KBIL for short) if it satisfies the following entailment equation:

(3) Prior Knowledge + New Knowledge ~ Observations

In the case of the medical student, the inferred pharmacological rule is sufficient to explain the observations, given the student's prior knowl- edge that allows her to infer the infection from which the patient is suffering. The knowledge must also include norms of sincere communi- cation and the ascription of goals and beliefs to the expert. This kind of reasoning, in which a small additional theory is proposed to complete an otherwise conclusive deductive explanation for a phenomenon, has often been called abductive. We will see below that the same entailment equation and preference criteria can be used to produce the kind of inductive theory formation typical of science. A key unifying principle is the preference for the simplest possible New Knowledge.

As with KFIL, constraints can be added to the program to prefer simple, explanatory hypotheses and so on. In this scheme, prior knowledge thus plays two roles in reducing the complexity of learning.

1. Since any hypothesis generated must be consistent with the prior knowledge as well as with the new observations, the effective hypothesis space size is reduced to include only those theories that are consistent with what is already known.

2. For any given set of observations, the new theory that has to be induced will be much shorter, since the prior knowledge will be available to help out the new rules in explaining the observations. This means that a learning program that is searching the hypothesis space simplest-first will have its task greatly simplified.

3.3. Techniques for theory generation

In this section I will briefly describe two computational approaches to inductive theory information. The first is a general-purpose, nearly complete method for generating solutions to KBIL learning problems.

Page 14: Inductive learning by machines

50 STUART RUSSELL

The second combines KBDL with inductive methods in an attempt to reduce the computational complexity of learning.

Inverse resolution systems If the theories and observations under consideration are expressible

in first-order logic, then Grdel's Completeness Theorem assures us that, given a correct hypothesis New Knowledge, a mechanical method can demonstrate that it satisfies equation 3. Such a method, called reso- lution, was developed by Robinson [1965]. A resolution proof consists of a sequence of applications of the resolution inference rule between pairs of sentences, generating a new sentence each time. The key to solving entailment equations such as 3 is to note that the resolution inference rule can be inverted; that is, a mechanical procedure can generate all pairs of sentences $1, $2, such that the result o f resolving $1 and $2 would be the given sentence S. The technique of inverse resolution was developed independently by both Russell [1986] and Muggleton and Buntine [1988[. The latter paper describes an imple- mented system for KBIL. A full inverse resolution system is capable of generating any first-order-expressible theory that explains a given set of data. Muggleton [1988] has applied an information-theoretic definition of simplicity 7 to produce a system that minimizes the size of the representation of the theory plus the unexplained observations, with results in close agreement with human inductive judgment.

Unfortunately, since the problem of checking for consistency be- tween the new theory and the prior knowledge is not decidable in general, it is not always possible to tell when the system has generated the shortest consistent, explanatory theory. However, by setting a time limit on attempts to prove inconsistency one can approximate a complete inductive system.

Some philosophers distinguish between prescientific methods of learning, such as the detection of correlations between observables, and scientific methods, which develop explanatory theories that posit new objects and properties to explain observed regularities [Newton-Smith, 1981]. Perhaps surprisingly, the ability to generate new objects and predicates falls directly out of the inverse resolution rule. Such con- structive induction will occur whenever the creation of new terms allows the agent to simplify its overall theory.

Page 15: Inductive learning by machines

INDUCTIVE LEARNING BY MACHINES 51

Even extremely simple worlds can lead to the creation of symbols that can be construed as referring to external objects. Consider the learning problem described in figure 2. The actions of the robot are the null action and ninety-degree turns right and left, and its sensory equip- ment reports only the colour directly ahead. The left column gives the sequence of visual inputs and actions, the right column shows a simple logical theory to explain them. s The theory involves three new predi- cates P, Q and R, which can be construed as "has colour", "is to the left of" and "I am facing", and four new objects G 1 through G4, which represent the walls.

Another typical case arises whenever the new term corresponds to a property of the world that can arise in several different ways and have several significant consequences. In the RALPH world one of the first things an agent learns is the At Food predicate, which eventually holds whenever the agent is in the same square as a food object. The predi- cate is created when the agent observes that its utility increases after an Eat action only if the food smell input is currently high, or if it previ- ously executed a Move Forward when the central part of its visual field

R e d

P i n k + R e d

Blue

Visual Input ( tl, Red) Action(q, Left)

Visual Input( t~, Pink) Action(t2, Right)

Visual Input(t3, Red) Action(t3, Right)

Visual lnput (t4, Red) Action(t4, Right)

I/Tsual lnput( ts, Blue) Action( ts, Null)

P( G1, Red) P( G2, Red) P( G3, Blue) P( G4, Pink) Q( G~, G2) Q( G2, G3) Q(G3, G,) Q(G4, G1) R(h, 6:1) Vngc R(tn, g) A P(g, c) ~ Visual lnput(tn, c) VnghR(t n, g) A Q(g, h) A Action(tn, Right) ~ R(tn+l, h) VnghR(t., g) A Q(h, g) A Aetion(tn, Left) ~ R(t.+~, h)

Yngh R(tn, g) A Action(tn, Null) ~ R(t.+ a, g)

Figure 2: Learning from primitive sensory input

Page 16: Inductive learning by machines

52 STUART RUSSELL

was bright green or yellow. Under the same conditions, the agent also observes that adjacent enemies no longer cause utility loss (actually they eat the food instead). The At Food predicate is generated directly by the inverse resolution process in two new hypotheses: one gives a set of sufficient sensory preconditions to be At Food; the other proposing At Food as a sufficient condition for the consequences noted. Note that the predicate is not defined in terms of observational primitives, but is constrained by them as to its possible meaning.

Such a program is capable of some more stunning discoveries, in principle. Consider the series of sensory inputs that a RALPH agent receives as it moves through its environment. Before any theory is available, each part of each input is a 'surprise': the agent cannot explain why some parts of its visual field remain constant as time passes, and others change. The most compact theory to explain the agent's observations is probably a two-dimensional Euclidean model with an absolute frame of reference (since most objects in the world stay put). This theory can be generated from the observational inputs by a series of applications of the inverse resolution rule. It is, however, unlikely that a discovery of this difficulty can be made in one go, starting from scratch, since the search space is far too large. The system must first generate theories of intermediate strength that nonetheless serve to simplify the overall representation of the data. A series of theories might start with the notion of external objects with fixed properties, as described above, followed by creation of the left-right distinction to explain the apparent motion of objects to the extremes of the visual field as the agent moves forward, and the predicate nearer to explain differences in apparent size and its rate of change. Only thorough investigation will reveal whether a sequence of sufficiently small steps exists to make the discovery process tractable.

Declarative bias systems Any simple search strategy used in an inverse resolution system is

bound to fail when it comes to dealing with large bodies of prior knowledge, since it will not be immediately clear which parts should be combined with the observations in the proof process. Intuitively, however, it seems that when an agent has more knowledge it should be easier to produce a small set of plausible hypotheses prior to examining

Page 17: Inductive learning by machines

INDUCTIVE LEARNING BY MACHINES 53

the observations. In the earlier historical discussion of concept learning programs, the programmer's knowledge of a suitable hypothesis space was seen to be crucial to success. Bias is the term used to describe the predisposition of an agent to consider only a restricted set of theories as possible explanations for a set of observations. In a declarative bias system, this predisposition is identified with a belief in an explicit sentence describing the world -- effectively the belief that one of the restricted set of theories is correct (this is identical with the implied premise of the Valiant model). If an agent can use its own prior knowledge to derive such a belief, then it has a way to construct a nice, small hypothesis space autonomously, without help from the pro- grammer. Once constructed, the hypothesis space can be searched for simple solutions to the KBIL entailment equation (3), secure in the knowledge that all the hypotheses are guaranteed consistent with the system's prior knowledge. In this paper we can only sketch a brief outline of the technique. A more thorough exposition and a system design appear in [Russell, 1989].

Obviously, a belief that the true theory is one of a given set of hypotheses can be expressed as a disjunction of those hypotheses. In any interesting case, this disjunction will be far too large to construct, and is often infinite. They key step in developing a declarative bias system is the observation that the disjunction (or a weaker version of it) can be compactly represented using determinations. For example, the determination P1 A P2 A P3 A P4 /k P5 > Q is logically equivalent to the disjunction of roughly four billion hypotheses as to the correct definition of the concept Q that are expressible in terms of the predi- cates P1 through Ps. The first stage in learning a concept is therefore the derivation from prior knowledge of the strongest possible determina- tion for the concept to be learned. This determination in effect identi- fies a minimal set of relevant features of the observations. For example, a RALPH agent with a background theory that ruled out action-at-a- distance would be able to conclude that the contents of the adjacent squares, and the agent's own action, determine the change in utility in the current time step. This phase corresponds to the process of estab- lishing relevant experimental variables that precedes any scientific investigation. The weaker the agent's prior knowledge, the more variables will have to be considered potentially relevant.

Page 18: Inductive learning by machines

54 STUART RUSSELL

Once the relevant properties are established, additional knowledge can be brought to bear to restrict the space of possible theories. For example, the agent might know that emptiness of an adjacent square can only be negatively related, if at all, to the possibility of an abrupt utility change. The result of this process is an "annotated skeleton", giving the structure of the theory and some constraints on how it might be fleshed out. The final step is to generate actual hypotheses using the experimental observations in an inverse resolution process. The anno- tated skeleton also makes it possible to design crucial experiments to resolve ambiguities in the theory structure. Theoretical results show that these methods sharply reduce the complexity of the theory formation task ]Russell, 1988].

One can summarize the state of the art in KBIL as follows. The entailment equation (2), together with some notion of simplicity, forms the specification of the learning problem. The method of inverse resolution provides a data-driven and partially theory-driven generating technique capable of finding arbitrary solutions to the entailment equation, including solutions involving novel entities and concepts. The declarative bias approach provides a deductive technique for directly constructing a small hypothesis space, all of whose elements are consistent with that prior knowledge; the search for inductive solutions is therefore simplified.

4. BURNING QUESTIONS AND RED HERRINGS

In one short paper one cannot hope to examine in depth all the philosphical questions that a mechanistic theory of theory formation might engender. Instead, I shall make brief and objectionable comments on some of the more obvious points of contention.

4.1. Social Scientists?

It seems to be currently trendy in some fields of study to propose that many phenomena -- beliefs, theories, memory to name but a few -- can arise only in the context of social interaction. Vygotsky, in the field of developmental psychology, seems to have attracted many to this way of thinking. Quine and Ullian seem to adopt it also with respect to science.

Page 19: Inductive learning by machines

INDUCTIVE LEARNING BY MACHINES 55

While there are no doubt rich veins of sociological gold to be dug in the study of the institutions and practice of science, it seems obvious that one can study the pure philosophical questions by consideration of an isolated agent trying to survive and prosper in a complex environment. No computational theory of learning has been proposed that would benefit from consideration of multiple agents. There are, however, some obscurities arising from the difficulty of defining an atheoretical intersubjective language in which scientists can discuss their theories, but this seems to be a question for the philosophy of language, not for epistemology.

4.2. Stable Observation Language

A fundamental objection to the empiricist programme in recent decades has been the rejection of the notion of a stable observation language (SOL) [Hesse, 1980, Kuhn, 1970], and with it the distinction between observational and theoretical terms ]Hesse, 1980, Newton-Smith, 1981]. Further catastrophes result when it is proposed that different theories are therefore necessarily incommensurable [Feyerabend, 1970, Kuhn, 1970], since the meaning of observational predicates is depend- ent on the theory one holds. The usual technique is to point to a supposedly primitive observation sentence such as "the block weighs 5kg", and discuss the various ways in which such a sentence presumes individuation of objects, constancy of mass, the possibility of fixed mass units, and so on. The classical positivist doctrine, which derived the empirical (although not deductive) fotmdation of scientific theories from their connections to irrefutable observations, seems to have been broadly rejected.

In the RALPH-world agent described above, and in any computer system attached to an environment, real or simulated, it is relatively easy to point to a stable observation language -- namely, the proposi- tional language corresponding to the presence or absence of stimuli in each of the cells of the system's sensory array. This language corre- sponds exactly to Quine's ]1960] definition of observation sentences as those whose meaning can be defined adequately by reference to the set of external sensory situations that would cause assent. Of course, the language is at a much lower level than the kind of language in which

Page 20: Inductive learning by machines

56 S T U A R T R U S S E L L

scientists exchange observations; but it has the desirable property, from a positivist's point of view, of being atheoretical. The agent doesn't even have to know what a sentence such as Pg(t3) ("the ninth cell of the visual field recorded a stimulus at the third time step") means in terms of properties of the external world, much less be able to communicate its meaning to another agent. 9 Nonetheless, the sentence does of course have empirical content, and is not vulnerable to the attack on the meaninglessness of "sense-impression talk" that bedevilled phenome- nology. Furthermore, as Quine observed, such sentences are in a useful sense infallible. From them, using an inductive method such as inverse resolution, we can build up to a more theoretical level, where sentences might include reference to external objects and predicates applying to them. As mentioned above, the theoretical terms (that is, any terms other than the sensory primitives) will not be defined in terms of observational predicates by 'meaning postulates', but will be logically constrained by sentences involving them. Thus what happens in the computer program resembles, in some ways, the 'holistic' conception of the meaning of theoretical terms in later positivist theories, or in Quine's network model.

It is possible, in fact, to push the level of observation sentences further up, perhaps even as far as predications about objects. For the same arguments I have made for the stability of primitive sensory sentences apply also to sentences derived from the primitive sentences, provided those sentences make no synchronic or diachronic claims about other observational sentences. This provides a recursive definition of the term 'observational sentence'. In essence, the role of the opera- tions used to derive the new observational sentences is to provide the agent with a new and perhaps better sense organ, which can in fact be augmented over time. The new observational sentences it provides are still infallible, since they make no predictions that can be compared with subsequent sensory inputs.

What sense, then, are we to make Popper's [1968b] claim that

Sense data, untheoret ical i tems of information, simply do not exist. For we always operate with theories, some of which are even incorporated in our physiology. A n d a sense organ is akin to a t h e o r y . . , we develop inside our skin.

Whatever brilliance went into designing a sense organ, its information is

Page 21: Inductive learning by machines

INDUCTIVE LEARNING BY MACHINES 57

still presented as an uninterpreted primitive predicate; even if the theory used by the designer is totally false, the sense organ is not therefore lying. It is up to the agent to find whatever correlations it can among its sense organs' observation sentences. In a sense other than that intended, however, Popper's observation is extremely perceptive. For the sense organ does incorporate knowledge of the world in its design, because it attempts to provide observation sentences in terms of which useful correlations can easily be found. If the sense organ is poorly designed, it will be harder for the agent to find good theories quickly. For example, if the sense organ is responsible for object individuation and sometimes segments images improperly, it will be hard for the agent to find any reliable theories relating the object predications with which it is presented.

I have argued, then, that the firm empirical foundation once pro- vided by observational sentences can be rebuilt. What of the notion of incommensurability? More or less the same arguments apply. It is perfectly possibly for our induction system to posit two distinct theories for comparison. The theories must be comparable at the level of sensory primitives; in fact, they can be comparable well above the level of observational sentences provided they agree on all the theoretical transitions involved. The less two theories share, the harder it will be to find tests that discriminate between them, but it will always be possible.

4.3. Knowledge? What ls Knowledge?

I have argued for the crucial role of accumulated knowledge in allowing significant inductive learning to take place. One could correctly reply that an incremental story in which an agent uses theories gleaned from observations 1 . . . n to interpret and learn from observations n + 1 . . . n -t- k is no more persuasive an account of learning than the batch method in which the agent simply calculates the best theory to account for all the observations 1 . . . n + k. This underlies a recent retort from the statistician V. Vapnik: "Knowledge? What is knowl- edge? All I know is samples!" In general it will not be the case that the incremental method can be made to yield the same results as the more "correct" batch method. The accumulation and use of knowledge

Page 22: Inductive learning by machines

58 STUART RUSSELL

is best regarded as a heuristic device, whose nature can be analyzed as follows.

When new observations are interpreted with respect to prior knowledge, treated as certain, they fall into three categories:

1. already accounted for by prior knowledge 2. inconsistent with prior knowledge 3. consistent with, but not predicted by, prior knowledge.

Clearly, case 3 can only arise with incomplete theories. If the prior theory is complete (roughly, it makes a concrete prediction in every possible state), then there is no possibility of its revision unless new observations contradict it; thus for complete theories there is no behavioural distinction between batch and incremental methods unless some form of locality is enforced when considering possible revisions. In the case of incomplete theories, there are two options: to consider revisions (rather than just additions) to past theories when unexplained observations arise (the revolutionary policy), or to revise past theories only when new observations are inconsistent (the conservative policy).

I will posit, for the sake of argument, that in the case of incomplete theories the concept of knowledge, as distinct from hypothesis, can be identified with the latter policy. The policy allows one to ignore all past data except when inconsistencies arise (in which case there is no alternative but to re-examine them). The policy can be modulated by the number of new observations that are allowed to accumulate before an explanation is sought; by distinguishing a subset of the prior theory that may be subject to revision without inconsistency; and by varying the 'entry requirements' on new explanatory hypotheses before they are considered as knowledge.

There are also possible disadvantages to an incremental policy: if the prior knowledge is incorrect, it will mislead the system in its attempt to understand new observations, leading to more complex overall theories than are strictly necessary. The conservative knowledge policy therefore trades accuracy for efficiency; given the complexity results obtained in learning theory, this is inescapable, and provides a rationale for conservative scientific behaviour. The problem of complexity therefore introduces a kind of inertia. Whether science proceeds in a cumulative or revolutionary fashion depends, to a large extent, on the observations

Page 23: Inductive learning by machines

I N D U C T I V E L E A R N I N G BY MACHINES 59

that are made. It may be that early observat ions lead one onto a

foothill, or "local maximum", in the climb towards perfection, such that

later observat ions can only be explained by abandoning the current

theory altogether. The ensuing revolut ion - - a search for a new route in

the inverse resolut ion process if you like - - can involve enormous

computa t ional costs, requiring sophisticated heuristics to increase the

l ikelihood of success.

4.4. Realism and the Transitoriness of Existence

There seem to be two principle elements to the realist position: first,

that theories can be true independent ly of the theorist; second, that

"evidence for the t ruth [of a theory] is evidence for the existence of

whatever has to exist for the theory to be true" [Newton-Smith, 1981].

F r o m the viewpoint of machine learning, the first point is unexcept ion-

able but no t particularly revealing. The second is somewhat more

puzzling, since it is so hard to operat ional ize what might be meant by a

claim of 'existence'. Presumably, a pa rad igm case of existence is

exhibited by the Eiffel Tower. But consider Einstein 's posit ion on

objec thood, which is so consonan t with the constructive induct ion

me thod given above that it's wor th quoting at length:

Out of the multitude of our sense experience we take, mentally and arbitrarily, certain repeatedly occurring complexes of sense impressions . . . and correlate to them a concept -- the concept of a bodily object. Considered logically this concept is not identical with the totality of sense impressions referred to; but it is a free creation of the human (animal) mind. On the other hand, this concept owes its meaning and its justification exclusively to the totality of the sense impressions we associate with it. The second step is to be found in the fact that in our thinking (which determines our expectations), we attribute to this concept of a bodily object a significance which is to a high degree independent of the sense impressions which originally gave rise to it. This is what we mean when we attribute to the bodily object a ~real existence'. [Einstein, 1936]

Put simply, the positing of the existence of an object is a way to

inductively collapse a collection of observat ions into a simple sentence.

Cons ider the Middle Ages. Their existence as a distinct object is

identical in kind, if not in degree, to that of the Eiffel Tower. But if we

were to discover addit ional facts about neighbouring centuries or even

about the Middle Ages themselves, a different individuation of history

might yield a simpler theory, so to speak, that no longer included the

Page 24: Inductive learning by machines

60 S T U A R T R U S S E L L

Middle Ages in its ontology. It is unlikely, but possible, that the same thing might happen to the Eiffel Tower. However, the portions of reality to which these terms refer would remain blissfully unaffected. More radically still, two equally simple axiomatizations of the same theory may contain entirely different ontologies, but the choice between them might well be made on entirely evanescent criteria (such as which one was thought of first).

Clearly, existence or the lack of it is not the serious commitment that realists propose. The problem may be a psychological one, in that the process of object individuation responsible for our ontological inclusion of the Eiffel Tower seems to very deeply rooted. Theoretical objects such as electrons owe their existence to the same process, more or less, but rather than viewing the position of the existence of electrons as a serious claim on reality, we should instead realize the ephemeral nature of the existence of the Eiffel Tower. After all, a serious field theory has no objects at all in its ontology for the entire universe, but its proponent does not, and should not, suffer a deep ontological crisis. Similarly, one might argue that proponents of cumulative models of science should not be dismayed by the apparently wholesale ontological slaughter accompanying scientific revolutions, since what is, or might be, accumu- lated is the set of correct observational consequences.

5. C O M M O N G R O U N D S F O R F U R T H E R W O R K

Suppes [1968], considering the task of a person trying to build a robot that could live and learn in the real world, came to the following conclusion:

As an evaluation of the depth of theory either in philosophy or psychology, we may ask what does either discipline have to offer such a man [sic], and the answer, I am afraid, is still pretty starkly negative.

It is indeed reasonable for such a person to ask a philosopher of science, particularly one claiming to be a rationalist, for guidance in the design of an inductive learning engine. It is not clear that the answers now would be any better than at the time Suppes made the statement, particularly as confirmation theory has gone out of fashion. However, he was overly pessimistic. Clearly, philosophy has given us a language,

Page 25: Inductive learning by machines

INDUCTIVE LEARNING BY MACHINES 61

first-order logic, that can serve to represent theories inside a machine; and logicians at least proved that a complete semi-decision procedure exists, even if they didn't provide it. The KBIL model underlies many discussions in the hypothetico-deductive tradition, and many useful philosophical contributions have been made concerning simplicity. Furthermore, Goodman's theory could have provided a basis from which to develop KBDL systems, had it been used.

A first area in need of philosophical sophistication is the validation of machine learning programs. At present there is no good criterion for judging the success of a program, since any concrete domain of applica- tion may have properties that just happen to suit the arbitrary aspects of the program design. The notion of success at inducing "a randomly generated theory" is equally problematic. More generally, work is needed to create and justify formal models of learning, such as Valiant's, that embody interesting and attainable criteria of success.

There are several areas in which machine learning could benefit from the experience of philosophers in formalizing the content of theories. In the KBDL model, the generation of hypotheses and the selection and design of experiments involve the application of weak prior theories. By examining actual examples, the process can be elucidated. Furthermore, the representation problem for weak theories might be addressed, since they do not seem to be easily expressed in standard logics. A related investigation might examine a sequence of possible theories of increasing strength, involving the ultimate creation of significant new concepts, to see what intermediate notions serve as stepping stones, and to see if the theories (combined with the observa- tions expressed with respect to the theories) monotonically decrease in complexity. It may be the case that a decreasing series, with steps of bounded size, can always be found for any final theory, which would have significant consequences.

If we assume that the extensive use of prior knowledge by an incremental learning strategy can effect a solution to the problem of complexity, then we are left with some interesting technical questions: 1) Are there languages for which the conservative knowledge policy can be shown not to lose accuracy compared to the full-revision approach? 2) What constraints must be imposed on the universe in order for the conservative knowledge policy to 'succeed' -- that is, lose only slightly

Page 26: Inductive learning by machines

62 S T U A R T R U S S E L L

in accuracy while gaining significantly in efficiency? Intuitively, the answer to the second question is that the universe must be significantly compressible, i.e. non-random. It remains to find formal definitions for 'slightly' and 'significantly'. However, if this intuition turns out to be correct, then some form of incremental, knowledge-based strategy must be the only possible approach -- since if the universe is not significantly compressible then no learning method is going to be tractable (or, for that matter, particularly useful). In other words, we may as well assume the game is worth playing.

N O T E S

* The research described herein has been supported by funding from the Lockheed AI Center and the California MICRO Program. The author would also like to thank Benjamin Grosof, John Pollock, Devika Subramanian, Thomas Dietterich and Steven Muggleton for their valuable comments and suggestions. i Of course, these buttons and lights are not some kind of phenomenologist's Vohrang, but are an integral, casually-connected part of the agent; "input and output wires" might be a better metaphor. There is no individuation of objects in the visual field, and the inputs can be seen as atheoretical sense data. 2 This presents the agent with the hidden variable dilemma familiar from the debate over quantum theory. 3 Of course, in this case the desired concept is disjunctive and so a conjunctive restriction will result in a failure to find a consistent concept definition. 4 This relationship is of course a simplifiction of the true connections among observa- tions, predictions and explanations. In this form, it applies only to cases where the instances are described by term structures to which the predicate of interest is applied (for example, integer (successor(O))). In general case, descriptions and 'outcomes' are both represented as predications on individuals. Thus only the observation conditional description D outcome is entailed by the theory. These issues are dealt with at length in the philosophical literature [Pollock, 1990]. The intent here is only to indicate the general distinctions among the various categories of learning problems. 5 Apparently, some tourists insist on referring to all Germans as Fritz. The origin of this misconception is obsure. 6 The determination P(x, y) > Q(x, z) has the first-order representation.

Vwxyz[P(w, y) A P(x, y) A O(w, z) ~ Q(x, z)]. v The highly problematic notion of simplicity has not been dealt with in this paper. The debate continues as to whether preference for simplicity can be justified ]Pearl, 1978, Kemeny, 1953]. The problem of representation-dependence of simplicity, and conse- quent arbitrariness of theory choice, has been addressed by the notion of Kolmogorov complexity [Solomonoff, 1964, Kolmogorov, 1965], which has given rise to a large subfield [Li and Vitanyi, 1989]. 8 We have assumed that the robot is aware of the ordering of its visual inputs, hence the use of integer indices on the time steps. Also note that if all four walls are different colours, or if the mdl action is not available, then there is no need to distinguish between the sensory stimulus and the external object that causes it. 9 Pollock [19861 has argued that such sentences do not really count as beliefs and thus cannot ground scientific theories. Leaving conscious awareness aside for the moment

Page 27: Inductive learning by machines

INDUCTIVE LEARNING BY MACHINES 63

(in fact, for the rest of the paper), it seems that the ability of a logical sentence to be used for inference and theory formation within the scientist should qualify it as a belief for the purposes of trying to understand how science is possible. A human, or complex knowledge-based system, might form beliefs about the observation sentence, for example that an observation sentence of a particular type is being formed, or that the observation sentence has such and such a meaning in terms of external objects, and these higher-level beliefs are surely fallible. But they are not the grounds from which theory formation begins.

REFERENCES

Blum, L. and Blum, M. 1975, 'Toward a mathematical theory of inductive inference'. Information and Control, 28, 125--155.

Buchanan, B. G. and Mitchell, T. M. 1978, 'Model-directed learning of production rules', Waterman, D. A. and Hayes-Roth, F., 0Eds.) Pattern-directed inference systems (Academic Press, New York).

De Jong, G. 1981, 'Generalizations based on explanations', Proceedings of the Seventh International Joint Conference on Artificial Intelligence (Morgan Kaufmann, Vancouver, BC), pp. 67--69.

Dietterich, T. G. 1986, 'Learning at the knowledge level', Machine Learning, 1 (3). Einstein, A. 1936, 'Physics and reality', Journal of the Franklin Institute, 221. Feyerabend, P. 1970, Against method (New Left Books, London). Gold, E. M. 1967, 'Language identification in the limit', Information and Control, 10,

pp. 447--474. Goodman, N. 1955, Fact, fiction and forest (Harvard University Press, Cambridge,

MA). Hesse, M. 1980, Revolutions and reconstructions in the philosophy of science (Indiana

University Press, Bloomington). Kemeny, J. G. 1953, 'The use of simplicity in induction', Philosophical Review, 62, pp.

391--408. Kolmogorov, A. N. 1965, 'Three approaches to the quantitative definition of informa-

tion', Problems in Information Transmission, 1(1), pp. 1--7. Kuhn, T. S. 1970, The structure of scientific revolutions (Chicago University Press,

Chicago, IL). Lakatos, I., and Musgrave, A. (Eds.), 1968, Problems in the philosophy of science.

(North Holland, Amsterdam). Li, M., and Vitanyi, P. M. B. 1989, An introduction to Kolmogorov complexity and its

applications (ACM Press, New York). Miller, D. 1974, 'Popper's qualitative theory of verisimilitude', British Journal for the

Philosophy of Science, 25, pp. 178--88. Mitchell, Tom M. 1982, 'Generalization as search', Artificial Intelligence, 18(2), 203--

226. Mitchell, T. M., Keller, R. M. & Kedar-Cabelli, S. T. 1986, 'Explanation-based

generalization: A unifying view', Machine Learning, 1, 47--80. Muggleton, S. H. 1988, 'A strategy for constructing new predicates in first-order logic',

Proceedings of the Third European Working Session on Learning (Pitman, Glasgow, Scotland), pp. 123-- 130.

Muggleton, S. H. and Buntine, W. 1988, 'Machine invention of first-order predicates by inverting resolution', In Proceedings of the Fifth International Machine Learning Conference (Morgan Kaufmann, Ann Arbor, Michigan).

Newton-Smith, W. H. 1981, The rationality of science (Routledge and Kegan Paul, London).

Page 28: Inductive learning by machines

64 STUART RUSSELL

Pearl, J. 1978, 'On the connection between the complexity and credibility of inferred models', International Journal of General Systems, 4, pp. 255--64.

Pollock, J. L. 1986, Contemporary theories of knowledge. (Rowman and Littlefield, Totowa, N J).

Pollock, J. L. 1990, Nomic probability and the foundations of induction. (Oxford University Press).

Popper, K. R. 1968a, The logic of scientific discovery. (Hutchinson, London). Popper, K. R. 1968b, 'Is there an epistemological problem of perception?', (Lakatos &

Musgrave, 1968), pp. 163--4. Quine, W. v. O. 1960, Word and object (MIT Press, Cambridge, MA). Quine W. v. O. & Ullian J. S. 1970, The web of belief (Random House, New York). Rivest, R. L. and Schapire, R. E. 1987, 'A new approach to unsupervised learning in

deterministic environments', Proceedings of the Fourth International Workshop on Machine Learning (Morgan Kanfmann, Irvine, CA).

Robinson, J. A. 1965, 'A machine-oriented logic based on the resolution principle', Journal of the ACM, 12(1), pp. 23--40.

Russell, S. J. 1986, 'Preliminary steps toward the automation of induction', Proceedings of the Fifth National Conference on Artificial Intelligence (Morgan Kaufmann, Philadelphia, PA).

Russell, S.J. 1988, 'Tree-structured bias', Proceedings of the Seventh National Conference on Artificial Intelligence (Morgan Kaufmann, Minneapolis, MN).

Russell, S. J. 1989, The use of knowledge in analogy and induction (Pitman, London). Solomonoff, R. J. 1964, 'A formal theory of inductive inference (parts I and II)',

Information and Control, 7, pp. 1--22, 224--54. Suppes, P. 1968, 'Information processing and choice behaviour', (Lakatos & Musgrave,

1968), pp. 278--99. Tichy, P. 1974, 'On Popper's definition of verisimilitude', British Journal for the

Philosophy of Science, 25, pp. 155--60. Valiant, L. G. 1984, 'A theory of the learnable', Communications of the ACM, 27, pp.

1134--1142. Warmuth, M. (Ed.), 1989, Proceedings of the Second International Workshop on

Computational Learning Theory (Morgan Kaufmann, Santa Cruz, CA).

Computer Science Division University of California Berkeley, CA 94720 USA