does machine learning

■ Does machine learning really work? Yes. Over thepast decade, machine learning has evolved from afield of laboratory demonstrations to a field of sig-nificant commercial value. Machine-learning algo-rithms have now learned to detect credit cardfraud by mining data on past transactions, learnedto steer vehicles driving autonomously on publichighways at 70 miles an hour, and learned thereading interests of many individuals to assemblepersonally customized electronic newspapers. Anew computational theory of learning is begin-ning to shed light on fundamental issues, such asthe trade-off among the number of training exam-ples available, the number of hypotheses consid-ered, and the likely accuracy of the learnedhypothesis. Newer research is beginning to exploreissues such as long-term learning of new represen-tations, the integration of Bayesian inference andinduction, and life-long cumulative learning. Thisarticle, based on the keynote talk presented at theThirteenth National Conference on Artificial Intel-ligence, samples a number of recent accomplish-ments in machine learning and looks at where thefield might be headed.

Ever since computers were invented, it hasbeen natural to wonder whether theymight be made to learn. Imagine comput-

ers learning from medical records to discoveremerging trends in the spread and treatment ofnew diseases, houses learning from experienceto optimize energy costs based on the particu-lar usage patterns of their occupants, or per-sonal software assistants learning the evolvinginterests of their users to highlight especiallyrelevant stories from the online morning news-paper. A successful understanding of how tomake computers learn would open up manynew uses for computers. And a detailed under-standing of information-processing algorithmsfor machine learning might lead to a better

understanding of human learning abilities(and disabilities) as well.

Although we do not yet know how to makecomputers learn nearly as well as people learn,in recent years, many successful machine-learning applications have been developed,ranging from data-mining programs that learnto detect fraudulent credit card transactions toinformation-filtering systems that learn users’reading preferences to autonomous vehiclesthat learn to drive on public highways. Alongwith this growing list of applications, therehave been important advances in the theoryand algorithms that form the foundations ofthis field.

The Niche for Machine LearningOne way to understand machine learning is toconsider its role within the more broad field ofcomputer science. Given the broad range ofsoftware paradigms within computer science,what is the niche to be filled by software thatlearns from experience? Obviously, for com-puter applications such as matrix multiplica-tion, machine learning is neither necessary nordesirable. However, for other types of prob-lems, machine-learning methods are alreadyemerging as the software development methodof choice. In particular, machine learning isbeginning to play an essential role within thefollowing three niches in the software world:(1) data mining, (2) difficult-to-program appli-cations, and (3) customized software applica-tions.

Data mining: Data mining is the process ofusing historical databases to improve subse-quent decision making. For example, banksnow analyze historical data to help decidewhich future loan applicants are likely to becredit worthy. Hospitals now analyze historical

Articles

FALL 1997 11

Does Machine Learning Really Work?

Tom M. Mitchell

All new material in this article is

Copyright © 1997, American Association for Artificial Intelligence. All rights reserved.

Portions of this arti-cle first appeared inMachine Learningby Tom M. Mitchell,© 1997 McGraw HillBook Company, andhave been reprintedhere with permissionfrom the publisher.

AI Magazine Volume 18 Number 3 (1997) (© AAAI)

Data: Patient103 time = 1 → Patient103 time = 2 … → Patient103 time = n

Age: 23 Age: 23 Age: 23FirstPregancy: no FirstPregancy: no FirstPregancy: noAnemia: no Anemia: no Anemia: noDiabetes: no Diabetes: YES Diabetes: noPreviousPrematureBirth: no PreviousPrematureBirth: no PreviousPrematureBirth: noUltrasound: ? Ultrasound: abnormal Ultrasound: ?Elective C-Section: ? Elective C-Section: no Elective C-Section: noEmergency C-Section: ? Emergency C-Section: ? Emergency C-Section: Yes… … …

Learned Rule: If No previous vaginal deliveray and

Abnormal 2nd Trimester Ultrasound, andMalpresentation at admissions, and

Then Probability of Emrgency C-Section is 0.6

Training set accuracy: 26/41 - .63Test set accuracy: 12/20 - .60

dar that customizes to the individual user’sscheduling preferences. Because it is unrealisticto manually develop a separate system for eachuser, machine learning offers an attractiveoption for enabling software to automaticallycustomize itself to individual users.

The following sections present examples ofsuccessful applications of machine-learningalgorithms within each of these three niches.As in any short summary of a large field, therole of this article is to sample a few of themany research projects and research directionsin the field to provide a feel for the current stateof the art and current research issues. Morelengthy treatments of the current state ofmachine learning can be found in Dietterich’sarticle (also in this issue) and in Mitchell(1997). Treatments of various subfields ofmachine learning are also available for topicssuch as statistical learning methods (Bishop1996), neural networks (Chauvin and Rumel-hart 1995; Hecht-Nielsen 1990), instance-basedlearning (Aha, Kibler, and Albert 1991), geneticalgorithms (Mitchell 1996; Forrest 1993); com-putational learning theory (Kearns and Vazi-rani 1994), decision tree learning (Quinlan1993), inductive logic programming (Lavracand Dzeroski 1994), and reinforcement learn-

data to help decide which new patients arelikely to respond best to which treatments. Asthe volume of online data grows each year,this niche for machine learning is bound togrow in importance. See www.kdnuggets.com/for a variety of information on this topic.

Difficult-to-program applications:Machine-learning algorithms can play anessential role in applications that have proventoo difficult for traditional manual program-ming—applications such as face recognitionand speech understanding. The most accuratecurrent programs for face recognition, forexample, were developed using training exam-ples of face images together with machine-learning algorithms. In a variety of applica-tions where complex sensor data must beinterpreted, machine-learning algorithms arealready the method of choice for developingsoftware.

Customized software applications: Inmany computer applications, such as onlinenews browsers and personal calendars, onewould like a system that automatically cus-tomizes to the needs of individual users after ithas been fielded. For example, we would like anews browser that customizes to the individ-ual user’s reading interests or an online calen-

In particular,machine

learning isbeginning to

play an essential

role withinthree niches

in the software

world: data mining,

difficult-to-program

applications,and

customizedsoftware

applications.

Articles

12 AI MAGAZINE

Figure 1. A Typical Data-Mining Application.A historical set of 9714 medical records describes pregnant women over time. The top portion of the figureillustrates a typical patient record, where ? represents that the feature value is unknown. The task here is toidentify classes of patients at high risk of receiving an emergency Cesarean section (C-section). The bottomportion of the figure shows one of many rules discovered from these data using Schenley Park Research’s RULER

system. Whereas 7 percent of all pregnant women received emergency C-sections, this rule identifies a subclassof high-risk patients, 60 percent of whom received emergency C-sections.

ing (Kaelbling, Littman, and Moore 1996; Bar-to, Bradtke, and Singh 1995).

Data Mining: Predicting Medical Outcomes from Historical DataData mining involves applying machine-learn-ing methods to historical data to improvefuture decisions. One prototypical example isillustrated in figure 1. In this example, histori-cal data describing 9714 different pregnantwomen have been provided, with each womandescribed by a set of 215 attributes over time.These attributes describe the woman’s healthhistory (for example, number of previous preg-nancies), measurements taken at various timesduring pregnancy (for example, ultrasoundresults), the type of delivery (for example, nor-mal, elective Cesarean sections [C-sections],emergency C-section), and final health ofmother and baby.

Given such time-series data, one is ofteninterested in learning to predict features thatoccur late in the time series based on featuresthat are known earlier. For example, given thepregnancy data, one interesting problem is topredict which future patients are at exception-ally high risk of requiring an emergency C-sec-tion. This problem is clinically significantbecause most physicians agree that if theyknow in advance that a particular patient is athigh risk of an emergency C-section, they willseriously consider an elective C-section insteadto improve the expected clinical outcome forthe mother and child. The bottom half of fig-ure 1 provides one typical rule learned fromthese historical data. Whereas the risk of emer-gency C-section for the entire population ofpatients in this case was 7 percent, this ruleidentifies a class of patients for which the riskis 60 percent. The rule was learned by the RULER

program developed by Schenley Park Research,Inc. The program used two-thirds of the avail-able data as a training set, holding the remain-ing one-third as a test set to verify rule accura-cy. As summarized in the figure, a total of 41patients in the training set match the rule con-ditions, of which 26 received an emergency C-section. A similar accuracy is found over thetest set (that is, of the 20 patients in the test setwho satisfy the rule preconditions, 12 receivedemergency C-sections). Thus, this rule success-fully identifies a class of high-risk patients,both over the training data and over a separateset of test data.

This medical example illustrates a prototyp-ical use of machine-learning algorithms fordata mining. In this example, a set of time-series data was used to learn rules that predictlater features in the series from earlier features.

Time-series prediction problems such as thisoccur in many problem domains and data sets.For example, many banks have time-seriesdata indicating which credit card transactionswere later found to be fraudulent. Many busi-nesses have customer databases that indicatewhich customers were later found to purchasecertain items. Universities have data on whichstudents were later found to successfully grad-uate. Manufacturers have time-series data onwhich process parameters later producedflawed or optimal products. Figure 2 illustratesa number of time-series prediction problemswith the same abstract structure as the preg-nancy example. In all these cases, the problemof discovering predictive relations betweenearlier and later features can be of great value.

What is the current state of the art for suchproblems? The answer is that we now have arobust set of first-generation machine-learningalgorithms that are appropriate for many prob-lems of this form. These include decisiontree–learning algorithms such as C4.5 (Quinlan1993), rule-learning algorithms such as CN2(Clark and Niblett 1989) and FOIL (Quinlan1990), and neural network–learning algo-rithms such as back propagation (Rummelhartand McClelland 1986). Commercial imple-mentations of many of these algorithms arenow available. At the same time, researchersare actively pursuing new algorithms thatlearn more accurately than these first-genera-tion algorithms and that can use types of datathat go beyond the current commercial state ofthe art. To give a typical example of this styleof research, Caruana, Baluja, and Mitchell(1995) describe a multitask learning algorithmthat achieves higher prediction accuracy thanthese other methods. The key to their successis that their system is trained to simultaneous-ly predict multiple features rather than just thefeature of interest. For example, their systemachieves improved accuracy for predictingmortality risk in pneumonia patients by simul-taneously predicting various lab results for thepatient.

What are good topics for new thesis researchin this area? There are many, such as develop-ing algorithms for active experimentation toacquire new data, integrating learning algo-rithms with interactive data visualization, andlearning across multiple databases. Two of mycurrent favorite research topics in this area are(1) learning from mixed-media data and (2)learning using private data plus internet-acces-sible data.

Learning from mixed-media data: Where-as rule-learning methods such as the one dis-cussed in figure 1 can be applied to data

Data mininginvolvesapplyingmachine-learningmethods tohistoricaldata toimprovefuture decisions.

Articles

FALL 1997 13

Customer Purchase BehaviorCustomer103: (time = t0) Customer103: (time = t1) … Customer103: (time = tn)

Sex: M Sex: M Sex: MAge: 53 Age: 53 Age: 53Income: $50k Income: $50k Income: $50kOwn House: Yes Own House: Yes Own House: YesMS Products: Word MS Products: Word MS Products: WordComputer 386 PC Computer Pentium Computer PentiumPurchase Excel?: ? Purchase Excel?: ? Purchase Excel?: Yes… … …

Customer Retention:Customer 103: (time = t0) Customer103: (time = t1) Customer103: (time = tn)

Sex: M Sex: M Sex: MAge: 53 Age: 53 Age: 53Income: $50K Income: $50K Income: $50KOwn House: Yes Own House: Yes Own House: YesChecking: $5k Checking: $20k Checking: $0Savings: $15k Savings: $0 Savings: $0Current-customer?: yes Current-customer?: yes Current-customer?: No

Process OptimizationProduct72: (time = t0) Product72: (time = t1) Product72: (time = tn)

Stage: mix Stage: cook Stage: coolMixing-speed: 60 rpm Temperature: 325 Fan-speed: mediumViscosity: 1.3 Viscosity: 3.2 Viscosity: 1.3Fat content: 15% Fat content: 12% Fat content: 12%Density: 2.8 Density: 1.1 Density: 1.2Spectral peak: 2800 Spectral peak: 3200 Spectral peak: 3100Product underweight?: ?? Product underweight?: ?? Product underweight?: Yes… … …

advantage of these data. Research on new algo-rithms that use such mixed-media data sets inmedicine and other domains could lead to sub-stantially more accurate methods for learningfrom available historical data and to substan-tially more accurate decision-making aids forphysicians and other decision makers.

Learning using private data plus internet-accessible data: To illustrate this topic, consid-er a university interested in using historicaldata to identify current students who are athigh risk of dropping out because of academicor other difficulties. The university might col-lect data on students over several years andapply a rule-learning method similar to thatillustrated in figure 1. Such first-generationlearning methods might work well, providedthere is a sufficiently large set of training dataand a data set that captures the features thatare actually relevant to predicting student

described by numeric features (for example,age) and symbolic features (for example, gen-der), current medical records include a richmixture of additional media, such as images(for example, sonogram images, x-rays), signalsfrom other instruments (for example, EKG sig-nals), text (for example, notes made on thepatient chart), and even speech (for example,in some hospitals, physicians speak into hand-held recorders as they make hospital rounds).Although we have machine-learning methodsthat can learn over numeric and symbolic data,other methods that can learn over images, andother methods that can aid learning withspeech data, we have no principled methodsthat can take advantage of the union of thesetypes of data. Put briefly, we are in a situationwhere we already have available a rich, mixed-media set of historical medical data but wherewe lack the machine-learning methods to take

Articles

14 AI MAGAZINE

Figure 2. Additional Examples of Time-Series Prediction Problems.

attrition. Now imagine a learning method thatcould augment the data collected for this studyby also using the rich set information availableover the World Wide Web. For example, theweb might contain much more completeinformation about the particular academiccourses taken by the student, the extracurricu-lar clubs they joined, their living quarters, andso on. The point here is that the internet canprovide a vast source of data that goes farbeyond what will be available in any given sur-vey. If this internet information can be usedeffectively, it can provide important additionalfeatures that can improve the accuracy oflearned rules. Of course, much of this web-based information appears in text form, and itcontains a very large number of irrelevant factsin addition to the small number of relevantfacts. Research on new methods that utilize theweb and other public online sources (for exam-ple, news feeds, electronic discussion groups)to augment available private data sets holdsthe potential to significantly increase the accu-racy of learning in many applications.

Learning to Perceive: Face RecognitionA second application niche for machine learn-ing lies in constructing computer programsthat are simply too difficult to program byhand. Many sensor-interpretation problemsfall into this category. For example, today’s topspeech-recognition systems all use machine-learning methods to some degree to improvetheir accuracy. A second example is image-clas-sification problems, such as the face-recogni-tion task illustrated in figure 3. The face imagesshown here have been used to train neural net-works to classify new face images according tothe identity of the person, his/her gender, andthe direction in which he/she is looking. Theseimages are part of a larger collection contain-ing 624 images of 20 different people (selectedstudents from Carnegie Mellon University’smachine-learning class). For each person,approximately 32 different grey-scale imageswere collected, varying the person’s expres-sion, the direction in which he/she is looking,and his/her appearance with or without sun-glasses. Given these data, students in themachine-learning class were asked to train acomputer program to correctly recognize theface in a new image. Students succeeded intraining a neural network to achieve this taskwith an accuracy of 90 percent at recognizingthese faces compared to the 5-percent accuracythat would be obtained by randomly guessingwhich of the 20 people is in the image. In con-trast, it is extremely difficult to manually con-

struct a face-recognition program of compara-ble accuracy.

To illustrate the use of neural networks forimage classification, consider the slightly sim-pler task of classifying these same imagesaccording to the direction in which the personis looking (for example, to his/her right or leftor up or straight ahead). This task can also beaccomplished with approximately 90-percentaccuracy by training the neural network shownat the top left of figure 3. The input to the neur-al network encode the image, and the outputclassify the input image according to the direc-tion in which the person is looking. Morespecifically, the image is represented by a 30 ×32 coarse-grained grid of pixel values, and eachof these 960 pixel values is used as a distinctinput to the network. The first layer of the net-work shown in the figure contains 3 units, eachof which has the same set of 960 input. Theactivation of each such unit is computed bycalculating a linear-weighted sum of the 960input, then passing the result through a sig-moid-squashing function that maps this linearsum into the interval [0,1]. The activations ofthese three units are then input to the final lay-er of four output units. As shown in the figure,each output unit represents one of the four pos-sible face directions, with the highest-valuedoutput taken as the network prediction. Thisnetwork is trained using the back-propagationalgorithm (Rummelhart and McClelland 1986)to adjust the weights of each network unit sothat the difference between the global networkoutput and the known correct value for eachtraining example is minimized.

One intriguing property of such multilayerneural networks is that during training, theinternal hidden units develop an internal rep-resentation of the input image that capturesfeatures relevant to the problem at hand. In asense, the learning algorithm discovers a newrepresentation of the input image that capturesthe most important features. We can examinethis discovered representation in the currentexample by looking at the 960 input weights toeach of the 3 internal units in the network offigure 3. Suppose we display the 960 weightvalues (one for weighting each input pixel) in agrid, with each weight displayed in the posi-tion of the corresponding pixel in the imagegrid. This visualization is exactly the visualiza-tion of network weights shown to the right ofthe network in the figure. Here, large positiveweights are displayed in bright white, largenegative weights in dark black, and near-zeroweights in grey. The topmost four blocks depictthe weights for each of the four output units,and the bottom three large grids depict the

A secondapplicationniche formachinelearning lies in constructingcomputer programs that are simply toodifficult toprogram byhand.

Articles

FALL 1997 15

Now consider the second hidden unit,whose activation will cause the correct outputunit to activate. The weights for this unit aredepicted in the middle diagram of the largeweight diagrams directly below the outputunit diagrams. Which pixels in the inputimage is this middle unit most sensitive to?Note first that the largest weight values (brightwhite and dark black) for this unit are associat-ed with pixels that cover the face or body ofthe person, and the smallest (light grey)weights correspond to the image background.Thus, the face classification will depend pri-marily on this section of the image. Second,notice the cluster of bright (positive) weightvalues near the left side of the “face” and thecluster of dark (negative) weight values nearthe right side. Thus, this particular unit willproduce a large weighted sum when it is givenan input image with bright pixels overlapping

weights for the three internal, or hidden, units.To understand these weight diagrams, con-

sider first the four blocks at the top right,which describe the weights for the four outputunits of the network. Let us look at the thirdblock, which encodes the weights for the thirdoutput unit, which represents that the personis looking to his/her right. This block showsfour weight values. The first (leftmost) of theseis the weight that determines the unit’s thresh-old, and the remaining three are the weightsconnecting the three hidden units to this out-put unit. Note this “right” output unit has alarge positive (bright-white) weight connect-ing to the middle hidden unit and a large neg-ative (dark-black) weight connecting to thethird hidden unit. Therefore, this unit will pre-dict that the person is facing to his/her right ifthe second hidden unit has a high activation,and the third hidden unit does not.

Articles

16 AI MAGAZINE

... ...

Typical Input Images

30 x 32 Network Inputs

Learned WeightsLeft Start Right Up

Figure 3. Face Recognition.A neural network (top left) is trained to recognize whether the person is looking to his/her left or right or straight ahead or up. Each of the30 × 32 pixels in the coarse-grained image is a distinct input to the network. After training on 260 such images, the network achieves anaccuracy of 90 percent classifying future images. Similar accuracy is obtained when training to recognize the individual person. The dia-grams on the top right depict the weight values of the learned network. These data and the learning code can be obtained atwww.cs.cmu.edu/~tom/faces.html.

these positive weights and dark pixels overlap-ping these negative weights. One example ofsuch an image is the third face image shown atthe bottom of the figure, in which the personis looking to his/her right. This unit will there-fore have a high output when given the imageof the person looking to his/her right, whereasit will have a low value when given an imageof a person looking to his/her left (for exam-ple, the first image shown in the figure).

Neural network–learning algorithms such asthe back-propagation algorithm given here areused in many sensor-interpretation problems.If you would like to experiment for yourselfwith this face-recognition data and learningalgorithm, the data, code, and documentationare available at www.cs.cmu.edu/~tom/faces.html.

What is the state of the art in applyingmachine learning to such perception prob-lems? In many perception problems, learningmethods have been applied successfully andoutperform nonlearning approaches (forexample, in many image-classification tasksand other signal-interpretation problems).Still, there remain substantial limits to the cur-rent generation of learning methods. Forexample, the straightforward application ofback propagation illustrated in figure 3depends on the face occurring reliably in thesame portion of the image. This network can-not easily learn to recognize faces independentof translation, rotation, or scaling of the facewithin the image. For an example of how suchneural network face-recognition systems canbe extended to accommodate translation andscaling, see Rowley, Baluja, and Kanade (1996).

There is a great deal of active research onlearning in perception and on neural networklearning in general. One active area of newresearch involves developing computationalmodels that more closely fit what is knownabout biological perceptual systems. Anotheractive area involves attempts to unify neuralnetwork–learning methods with Bayesianlearning methods (for example, learning ofBayesian belief networks). One of my favoriteareas for new thesis topics is long-term learn-ing of new representations. Because of the abil-ity of neural networks to learn hidden-layerrepresentations, as illustrated in figure 3, theyform one good starting point for new researchon long-term learning of representations.

A Self-Customizing News ReaderA third niche for machine learning is comput-er software that automatically customizes to itsusers after it has been fielded. One prototypicalexample is the NEWS WEEDER system (Lang1995), an interface to electronic news groupsthat learns the reading interests of its users.Here, we briefly summarize NEWS WEEDER and itslearning method.

The purpose of NEWS WEEDER is to learn to fil-ter the flood of online news articles on behalfof its user. Each time the user reads an articleusing NEWS WEEDER, he/she can assign it a ratingfrom 1 (very interesting) to 5 (uninteresting) toindicate interest in the article. NEWS WEEDER

uses these training examples to learn a generalmodel of its user’s interests. It can then use thislearned interest model to automatically exam-ine new articles, or articles from other newsgroups, to assemble a “top 20” list for the user.

Articles

FALL 1997 17

Table 1. Results of NEWS WEEDER for One User.After observing the user read approximately 1300 news articles, the system predicted ratings for approximately600 new articles. The previous confusion matrix shows the relationship between NEWS WEEDER’s predictions andthe actual user ratings. For example, the 20 in the table indicates there were 20 articles for which NEWS WEEDER

predicted a rating of 4 but that the user rated a 3. Note the large numbers along the diagonal indicate correctpredictions.

True Rating Predicted Rating1 2 3 4 5 Skip Total

1: 0 1 0 0 0 1 22: 1 15 6 4 0 15 413: 0 6 31 20 0 15 724: 0 6 8 42 0 20 765: 0 0 0 4 0 1 5skip: 0 8 4 5 1 141 159

higher the conditional probabilities P(wj | Ci)for the words wj that are actually encounteredin d, the greater the chance that Ci is the cor-rect rating for d.

In fact, NEWS WEEDER’s learning method usesa variant of this algorithm, described in Lang(1995). One difference is that the documentlength is taken into account in estimating theprobability of word occurrence. A second isthat the system combines the strength of itspredictions for each rating level, using a linearcombination of these predictions to produceits final rating.

How well does NEWS WEEDER learn the user’sinterests? In one set of experiments, the sys-tem was trained for one user using a set ofapproximately 1900 articles. For each articlethe user read, a rating of 1 to 5 was assigned.Articles that the user chose not to read werelabeled skip. A subset of approximately 1300 ofthese articles was used to train the system, andits performance was then evaluated over theremaining articles. NEWS WEEDER’s predicted rat-ings are compared to the actual user ratings intable 1. Note that numbers along the diagonalof this confusion matrix indicate cases wherethe predicted and true ratings were identical,whereas off-diagonal entries indicate differ-ences between these two. As is apparent fromtable 1, NEWS WEEDER does a fairly good job ofpredicting user interest in subsequent docu-ments. Note that in many cases, its errors rep-resent small differences between the predictedand actual ratings (for example, many of itserrors when predicting a rating of 4 were arti-cles that were actually rated 3 by the user).

NEWS WEEDER illustrates the potential role ofmachine learning in automatically customiz-ing software to users’ needs. It also illustrateshow machine learning can be applied to thegrowing volume of text data found on elec-tronic news feeds, the World Wide Web, andother online sources. Because the initial exper-iments with NEWS WEEDER were promising, asecond-generation system called WISE WIRE hasnow been developed at WiseWire Corporation.This system is available free of charge atwww.wisewire.com.

The current state of the art for learning toclassify text documents is fairly well represent-ed by the naive Bayes text classifier describedhere. An implementation of this algorithm fortext classification can be obtained at www.cs.cmu.edu/~tom/ml-examples.html. Machinelearning over text documents is an active areaof current research. One interesting researchtopic involves going beyond the bag-of-wordsapproach, representing the text by more lin-guistic features such as noun phrases and par-

Given that there are thousands of news groupsavailable online and that most users read onlya few, the potential benefit is that the agentcan perform the task of sorting through thevast bulk of uninteresting articles to bring thefew interesting ones to the attention of its user.

The key machine-learning question underly-ing NEWS WEEDER is, How can we devise algo-rithms capable of learning a general model ofuser interests from a collection of specific textdocuments? NEWS WEEDER takes an approachthat is based both on earlier work in informa-tion retrieval and on Bayesian learning meth-ods. First, each text article is re-represented bya very long feature vector, with one feature foreach possible word that can occur in the text.Given that there are approximately 50,000words in standard English, this feature vectorcontains approximately 50,000 features. Thevalue for each feature (for example, the featurecorresponding to the word windsurf) is thenumber of times the word occurs in the docu-ment. This feature-vector representation, basedon earlier work in information retrieval, issometimes called a bag of words representationfor the document because it omits informationabout word sequence, retaining only informa-tion about word frequency.

Once the document is represented by such afeature vector, NEWS WEEDER applies a variant ofa naive Bayes classification learning method tolearn to classify documents according to theirinterestingness rating. In the naive Bayes ap-proach, the system assigns a rating to each newarticle based on several probability terms thatit estimates from the training data, as describ-ed in the following paragraph.

During training, the naive Bayes classifierdetermines several probabilities that are used toclassify new articles. First, it determines the pri-or probability P(Ci) for each possible rating Ci bycalculating the frequency with which this rat-ing occurs over the training data. For example,if the user assigned a top rating of 1 to just 9 ofthe 100 articles he/she read, then P(Rating = 1)= 0.9. Second, it estimates the conditional prob-ability P(wj | Ci) of encountering each possibleword wj given that the document has a true rat-ing of Ci. For example, if the word windsurfoccurs 11 times among the 10,000 wordsappearing in articles rated 1 by the user, thenP(windsurf | Rating = 1) = .0011. A new text doc-ument d is then assigned a new rating bychoosing the rating Ci that maximizes

where Πwj∈ d P(wj | Ci) represents the product ofthe various P(wj | Ci) over all words wj foundin the document d. As we might expect, the

P C P w Ci jw d

ij

( ) ( | )∈

∏

Articles

18 AI MAGAZINE

tial interpretations of the meaning of the text.In addition to classifying text documents, oth-er important text-learning problems includelearning to extract information from text (forexample, learning to extract the title, speaker,room, and subject, given online seminarannouncements). Given the rapidly growingimportance of the World Wide Web and otheronline text such as news feeds, it seems likelythat machine learning over text will become amajor research thrust over the coming years.

Where Next?As the previous three examples illustrate,machine learning is already beginning to havea significant impact in several niches withincomputer science. My personal belief is that theimpact of machine learning is certain to growover the coming years, as more and more datacome online, we develop more effective algo-rithms and underlying theories for machinelearning, the futility of hand-coding increasing-ly complex systems becomes apparent, andhuman organizations themselves learn the val-ue of capturing and using historical data.

Put in a long-term historical perspective,computer science is a young field still in its first50 years. Perhaps we will look back after thefirst hundred years to see that the evolution ofcomputer science followed a predictablecourse: During its first few decades, computerscience focused on fairly simple algorithmsand computer programs and developed a basicunderstanding of data structures and computa-tion. During the next several decades, it strug-gled with the problem of how to constructincreasingly complex software systems butcontinued to follow its early approach of man-ually hand crafting each line of code. Duringthe second half of its first century (hopefully!),a shift is seen to take place in which softwaredevelopers eventually abandon the goal ofmanually hand crafting increasingly complexsoftware systems. Instead, they shift to a soft-ware development methodology of manuallydesigning the global structure of the systembut using machine-learning methods to auto-matically train individual software compo-nents that cannot be found in a softwarelibrary and to optimize global performance. Ofcourse, we will have to wait to find out howthe second 50 years pans out, but at least forthe three niches discussed here, it is alreadythe case that machine learning competesfavorably with manual hand-crafted code.

As for us researchers, many exciting oppor-tunities are available right now regardless ofhow the next 50 years pan out. In addition to

some of the specific research directions notedearlier, I believe there are several areas wheresignificant breakthroughs are possible. Beloware some of my speculations on which researchareas might produce dramatic improvementsin the state of the art of machine learning.

Incorporation of prior knowledge withtraining data: One striking feature of thelearning methods described earlier is that theyare all tabula rasa methods; that is, they assumethe learner has no initial knowledge except forthe representation of hypotheses and that itmust learn solely from large quantities of train-ing data. It is well understood that priorknowledge can reduce sample complexity andimprove learning accuracy. In fact, machine-learning algorithms such as explanation-basedlearning (DeJong 1997) and, more recently,Bayesian belief networks provide mechanismsfor combining prior knowledge with trainingdata to learn more effectively. Nevertheless, itremains a fact that almost all the successfuluses of machine learning to date rely only ontabula rasa approaches. More research is need-ed to develop effective methods of combiningprior knowledge with data to improve theeffectiveness of learning.

Lifelong learning: Most current learningsystems address just one task. However, humanslearn many tasks over a long lifetime and seemto use the accumulating knowledge to guidesubsequent learning. One interesting directionin current research is to develop long-life agentsthat accumulate knowledge and reuse it toguide learning in a similar fashion. These agentsmight learn new representations, as well asfacts, so that they can more effectively representand reason about their environment as theyage. Several researchers are already examiningthis topic in the context of robot learning andautonomous software agents.

Machine learning embedded in program-ming languages: If one assumes that machinelearning will play a useful and routine rolewithin certain types of computer software,then it is interesting to ask how future softwaredevelopers will incorporate learning into theirsoftware. Perhaps it is time for the develop-ment of new programming languages or pro-gramming environments that allow develop-ers to incorporate machine learning into theirsoftware as easily as current environmentsallow incorporating abstract data types andother standard software constructs.

Machine learning for natural language:One fact that seems certain to influencemachine learning is the fact that the majorityof the world’s online data is now text. Givenapproximately 200,000,000 pages on the

… the impactof machinelearning iscertain togrow over thecoming years,as more andmore datacome online,we developmore effectivealgorithmsand underlyingtheories formachinelearning, the futility ofhand-codingincreasinglycomplex systemsbecomesapparent, and humanorganizationsthemselveslearn the value of capturing and using historicaldata.

Articles

FALL 1997 19

Joachims, T.; Freitag, D.; and Mitchell, T. 1997. WEB-WATCHER: A Tour Guide for the World Wide Web. InProceedings of the Fifteenth International JointConference on Artificial Intelligence. Menlo Park,Calif.: International Joint Conferences on ArtificialIntelligence. Forthcoming.

Kaelbling, L. P.; Littman, M. L.; and Moore, A. W.1996. Reinforcement Learning: A Survey. Journal ofAI Research 4:237–285. Available at www.cs.washing-ton.edu/research/jair/home.html.

Kearns, M. J., and Vazirani, U. V. 1994. An Introduc-tion to Computational Learning Theory. Cambridge,Mass.: MIT Press.

Lang, K. 1995. NEWS WEEDER: Learning to Filter NET-NEWS. In Proceedings of the Twelfth International Con-ference on Machine Learning, eds. A. Prieditis and S.Russell, 331–339. San Francisco, Calif.: Morgan Kauf-mann.

Lavrac, N., and Dzeroski, S. 1994. Inductive Logic Pro-gramming: Techniques and Applications. Ellis Hor-wood.

Mitchell, T. M. 1997. Machine Learning. New York:McGraw Hill.

Mitchell, M. 1996. An Introduction to Genetic Algo-rithms. Cambridge, Mass.: MIT Press.

Mitchell, T. M.; Caruana, R.; Freitag, D.; McDermott,J.; and Zabowski, D. 1994. Experience with a Learn-ing Personal Assistant. Communications of the ACM37(70): 81–91.

Quinlan, J. R. 1993. C4.5: Programs for Machine Learn-ing. San Francisco, Calif.: Morgan Kaufmann.

Quinlan, J. R. 1990. Learning Logical Definitionsfrom Relations. In Machine Learning 5, 239–266. NewYork: Kluwer Academic.

Rowley, H.; Baluja, S.; and Kanade, T. 1996. HumanFace Detection in Visual Scenes. In Advances in Neur-al Information Processing Systems 8, eds. D. S. Touret-zky, M. C. Mozer, and M. E. Hasselmo. Cambridge,Mass.: MIT Press.

Rumelhart, D. E., and McClelland, J. L. 1986. ParallelDistributed Processing: Exploration in the Microstructureof Cognition, Volumes 1 and 2. Cambridge, Mass.: MITPress.

Tom Mitchell is currently profes-sor of computer science and robot-ics at Carnegie Mellon University(CMU) and director of the CMUCenter for Automated Learningand Discovery. He recently com-pleted a comprehensive textbookentitled Machine Learning (Mc-Graw Hill, 1997). Mitchell’s cur-

rent research involves applying machine learning tothe vast store of undisciplined information found onthe World Wide Web and other internet sources. Hise-mail address is [email protected].

World Wide Web, and approximately1,000,000,000 hyperlinks, it might be time toconsider machine learning for natural lan-guage problems. Consider that each of thesebillion hyperlinks is a kind of semisupervisedtraining example that indicates the meaningof the words underlined by the hyperlink. Forexample, the meaning of a hyperlink such asmy favorite vacation spot is indicated by the webpage to which it points. Given that there areonly 50,000 words in everyday English, the1,000,000,000 hyperlinks produce, on average,20,000 training examples of each word. If wecould invent learning algorithms that use thiskind of training data, we might be able tomake significant new progress on natural lan-guage processing.

AcknowledgmentsThe research described here is the result of theefforts of a number of people and organiza-tions. Schenley Park Research, Inc., providedthe medical data-mining example. Jeff Shufelt,Matt Glickman, and Joseph O’Sullivan helpedcreate the code and collect the data for theface-recognition example. This work has beensupported in part by the Defense AdvancedResearch Projects Agency under research grantF33615-93-1-1330.

Parts of this article are excerpted from thetextbook Machine Learning (Mitchell 1997).

ReferencesAha, D.; Kibler D.; and Albert, M. 1991. Instance-Based Learning Algorithms. In Machine Learning 6,37–66. New York: Kluwer Academic.

Barto, A.; Bradtke, S.; and Singh, S. 1995. Learning toAct Using Real-Time Dynamic Programming. Artifi-cial Intelligence (Special Issue on ComputationalResearch on Interaction and Agency) 72(1): 81–138.

Bishop, C. M. 1996. Neural Networks for Pattern Recog-nition. Oxford, U.K.: Oxford University Press.

Caruana, R.; Baluja, S.; and Mitchell, T. 1995. Usingthe Future to Sort Out the Present: RANKPROP and Mul-titask Learning for Medical Risk Analysis. In NeuralInformation Processing 7, eds. G. Tesauro, D. S. Touret-zky, and T. K. Leen. Cambridge, Mass.: MIT Press.

Chauvin, Y., and Rumelhart, D. 1995. Backpropaga-tion: Theory, Architectures, and Applications. Hillsdale,N.J.: Lawrence Erlbaum.

Clark, P., and Niblett, R. 1989. The CN2 InductionAlgorithm. In Machine Learning 3, 261–284. NewYork: Kluwer Academic.

DeJong, G. 1997. Explanation-Based Learning. InThe Computer Science and Engineering Handbook, ed. A.Tucker, 499–520. Boca Raton, Fla.: CRC Press.

Forrest, S. 1993. Genetic Algorithms: Principles ofNatural Selection Applied to Computation. Science261:872–878.

Hecht-Nielsen, R. 1990. Neurocomputing. Reading,Mass.: Addison Wesley.

Articles

20 AI MAGAZINE

does machine learning

Software