gecco 99

Upload: george-tsavd

Post on 07-Jan-2016

217 views

Category:

Documents


0 download

DESCRIPTION

CRF

TRANSCRIPT

  • Timeweaver: a Genetic Algorithm for Identifying PredictivePatterns in Sequences of Events

    Gary M. WeissRutgers University and AT&T Labs

    101 JFK ParkwayShort Hills, NJ 07078

    Abstract

    Learning to predict future events from sequencesof past events is an important, real-world,problem that arises in many contexts. This paperdescribes Timeweaver, a genetic-based machinelearning system that solves the event predictionproblem by identifying predictive temporal andsequential patterns within data. Timeweaver isapplied to the task of learning to predicttelecommunication equipment failures from250,000 alarm messages and is shown tooutperform existing methods.

    1 INTRODUCTIONData is being generated and stored at an ever-increasingpace, and, partly as a consequence of this, there has beenincreased interest in how machine learning and statisticaltechniques can be employed to extract useful knowledgefrom this data. When this data is time-series data, it isoften important to be able to predict future behavior basedon past data. In this paper we are interested in the problemof predicting specific types of rare future events, whichwe refer to as target events, from sequences oftimestamped events. We restrict ourselves to domainswhere the events are described by categorical (i.e., non-numerical) features, since statistical methods already existthat can solve time-series prediction problems withnumerical features. We call the class of problems weaddress in this paper rare event prediction problems. Forthese prediction problems, every time an event isreceived, a prediction procedure is applied which, basedon the past events, determines whether the target eventwill occur in the near future. The problem of predictingtelecommunication equipment failures from logs of alarmmessages is one example of this type of predictionproblem. In this case, prediction of a failure might causeone to replace, or at least route phone traffic around, thesuspect piece of equipment. Other examples of rare eventprediction problems include predicting fraudulent creditcard transactions and the start of transcription in DNAsequences.

    Machine learning and statistical methods have been usedto solve problems similar to the rare event predictionproblem, but most of these methods are not applicable tothis class of problems. The statistical methods do notapply because they require numerical features (Brockwell& Davis, 1996). The many machine learning methodsthat perform classification do not apply because theyassume unordered examplesnot time-ordered events.Thus, these methods cannot learn from sequential ortemporal relationships between events. The machinelearning methods that are useful for modeling sequencesare also not appropriate, since we do not need to modelthe entire sequencewe only need to

    predict one specifictype of event within a window of time.Our approach to solving the event prediction probleminvolves using a genetic algorithm to directly search forpredictive patterns in the data. Our system will learn a setof rules of the form pattern target event, and hence maybe considered a classifier system. However, because oursystem does not provide any form of internal memory orchaining of rules, and because the rules all have acommon right-hand side, it is more appropriate to viewour system as a genetic-based machine learning system.We feel our work shares much in common with othersuch systems, most notably COGIN (Greene & Smith,1993) and GABIL (De Jong, Spears & Gordon, 1993).The event prediction problem has been described in anearlier paper, but only a very brief description of thegenetic algorithm was provided (Weiss & Hirsh, 1998).In this paper we provide a detailed description ofTimeweaver, our genetic-based machine learning systemthat solves the rare event prediction problems byidentifying predictive patterns in the data. The eventprediction problem has some interesting characteristicsthat affect the design of our genetic algorithm. First,because we expect to encounter very large datasets, theGA must minimize the number of patterns that itevaluates. Second, because the events we are interestedin predicting are typically rare and difficult to predict,predictive accuracy is not an appropriate fitnessmeasurethe strategy of never predicting any targetevents would often maximize predictive accuracy. Toavoid this problem, we base our fitness function on recalland precision, two measures from the information

  • retrieval literature. Recall will measure the percentage oftarget events that are successfully predicted and precisionthe percentage of predictions that are correct. By factoringin recall, we can ensure that a prediction strategy isdeveloped that predicts the majority of the target events.

    2 THE EVENT PREDICTION PROBLEMIn this section we provide a formal description of theevent prediction problem. An event Et is a timestampedobservation occurring at time t that is described by a setof feature-value pairs. An event sequence is a time-ordered sequence of events, S

    = Et1, Et2, ..., Etn. Thetarget event is the specific type of event we would like tolearn to predict. The event prediction problem is to learna prediction procedure that, given a sequence oftimestamped events, correctly predicts whether the targetevent will occur in the near future. In this paper weassume the prediction procedure involves matching a setof learned patterns against the data and predicting theoccurrence of a target event if the match succeeds. Wesay a prediction occurring at time t, Pt, is correct if atarget event occurs within its prediction period. Asshown below, the prediction period is defined by awarning time, W, and a monitoring time, M.

    The warning time is the lead time necessary for aprediction to be useful and the monitoring timedetermines how far into the future the prediction extends.While the value of the warning time is criticallydependent on the problem domain (e.g., how long it takesto replace a piece of equipment), there is generally a lot offlexibility in assigning the monitoring time. The largerthe value of the monitoring time, the easier the predictionproblem, but the less meaningful the prediction.A target event is correctly predicted if at least oneprediction is made within its prediction period. Theprediction period of target event X, occurring at time t, isshown below:

    3 THE GENERAL APPROACHOur approach to solving the event prediction probleminvolves two steps. In the first step, a genetic algorithm isused to search the space of prediction patterns, in order toidentify a set of patterns that individually do well atpredicting a subset of the target events and collectivelypredict most of the target events. This is a Michigan-styleGA, since each individual pattern is only part of acomplete solution. Other GA-based systems have usedthis approach to learn disjunctive concepts from examples(Giordana, Saitta & Zini, 1994; Greene & Smith, 1993;McCallum & Spackman, 1990). However, rather than

    simply forming a solution from all of the individuals inthe population, we employ a second step. This step ordersthe patterns from best to worst, based primarily on theprecision of their predictions, and prunes redundantpatterns (i.e., those patterns that do not predict any targetevents not already predicted by some better pattern). Afamily of prediction strategies is then formed byincrementally adding one pattern at a time and aprecision/recall curve is generated based on thesestrategies. A user can then use this curve, and the relativecost of false predictions versus missed (i.e., not predicted)target events, to determine the optimal prediction strategy.Our GA uses steady-state reproduction, where only a fewindividuals are replaced at once, rather than generationalreproduction, where a significant percentage of thepopulation is replaced. We chose to use steady-statereproduction because it makes newly created membersimmediately available for use, whereas with generationalreproduction, the new individuals cannot be used until thenext generation (Syswerda, 1990). Given that largedatasets will make it computationally expensive toevaluate each new pattern, we believe it is very importantthat new individuals are made immediately available.

    4 THE GENETIC ALGORITHMThis section describes a genetic algorithm that searchesfor patterns that successfully predict the future occurrenceof target events. The basic steps in our steady-state GAare shown below:1. Initialize population2. while stopping criteria not met3. select 2 individuals from the population4. apply the crossover operator with probability PC

    and mutation operator with probability PM5. evaluate the 2 newly formed individuals6. replace 2 existing individuals with the new ones7. doneSince we are using a Michigan-style GA, the performanceof the population cannot be accurately estimated based onthe performance of individual members of the population.Evaluating the performance based on all of the patternswould also be misleading, since there may be some verypoor rules in the population. To accurately estimate theperformance of the GA, we apply our second step every250 iterations of the GA. This forms a completeprediction strategy using the best patterns in thepopulation (as described in detail in Section 5).The remainder of this section describes the mostimportant aspects of the GA. In particular, muchattention is devoted to the fitness function and diversitymaintenance strategy, since these account for much of thecomplexity of our system. We use a niching strategycalled sharing to ensure that diversity is preserved andthat the prediction patterns in the population cover amajority of the target events. The niching strategy resultsin a new measure, shared fitness, which factors in bothfitness and diversity considerations. It is this shared

    Xtt - Wt - M

    prediction period

    t + WPt

    prediction period

    t + M

  • fitness measure that is used to implement the selectionand replacement strategies. In particular, individuals areselected proportional to their shared fitness and removedinversely proportional to their shared fitness.

    4.1 REPRESENTATIONEach individual in our GA is a prediction patternapattern used to predict target events. The language usedto describe these patterns is similar to the language usedto represent the raw event sequences. A predictionpattern is a sequence of events in which consecutiveevents are connected by an ordering primitive that definesordering constraints between these events. The followingordering primitives are supported:

    the wildcard * primitive matches any number ofevents so the prediction pattern A*D matches ABCD

    the next . primitive matches no events so theprediction pattern A.B.C only matches ABC

    the parallel | primitive allows events to occur inany order and is commutative so that the predictionpattern A|B|C will match, amongst others, CBA.

    The | primitive has highest precedence so the patternA.B*C|D|E matches an A, followed immediately by aB, followed sometime later by a C, D and E, in any order.Each feature in an event is permitted to take on any of apredefined list of valid feature values, as well as thewildcard (?) value, which matches any feature value.For example, if events in a domain are described by threefeatures, then the event

  • child. The additional value is chosen as follows: if d isthe smaller pattern duration of P1 and P2, then a valuerandomly chosen between 0 and d is evaluated; otherwisea random value between d and 2d is evaluated. Thisprocedure allows new pattern duration values to beintroduced.The mutation operator modifies the value of either thepattern duration, the feature values in the events, or theordering primitives. In each case, mutation causes avalid random value to be selected. Note that this allowsmore general or specific patterns to be formed. Forexample, when a feature value is mutated, 50% of thetime it will be set to the wildcard value and 50% of thetime to a randomly selected feature value. Mutation of apattern duration with value d results in a random valuebetween 0 and 2d being generated.

    4.4 THE FITNESS FUNCTIONThe fitness function must take an individual pattern andreturn a value indicating how good the pattern is atpredicting target events. As mentioned earlier, predictiveaccuracy is not a good measure of fitness since the targetevents may occur very infrequently and because we areusing the Michigan approach where each pattern is onlypart of the total solution. Consequently, recall andprecision form the basis of our fitness function. Theseinformation retrieval measures, which are described indetail below, are appropriate for measuring the fitness ofindividual patterns or collections of patterns. Thesemeasures are summarized in Figure 2. For our problem,recall is the fraction of the target events that aresuccessfully predicted and precision is the fraction of the(positive) predictions that are correct. The negativepredictions are not factored in because they are alwaysignoredonly the presence or absence of a positiveprediction is used to determine whether to predict thefuture occurrence (or non-occurrence) of a target event.As described in Section 2, a target event may be predictedwhenever an event is received; thus, multiple validpredictions of a single target event may occur. Precisionis a misleading measure since it counts these multiplepredictions multiple times. Since the user is expected totake action upon seeing the first positive prediction,counting the subsequent predictions in the precisionmeasure is improper. The normalized precisioneliminates this multiple counting by replacing the numberof correct (positive) predictions with the number of targetevents correctly predicted.Normalized precision still does not account for the factthat n incorrect positive predictions located closelytogether may not be as harmful as the same numberspread out over time (depending on the nature of theactions taken in response to the prediction of a targetevent). We use reduced precision to remedy this. Aprediction is considered active for a period equal to itsmonitoring time, since the target event should occursomewhere during that period. Two false predictionslocated close together will have overlapping active

    periods. The reduced precision measure replaces thenumber of false positive predictions with the number ofcomplete, non-overlapping, prediction periods associatedwith a false prediction. Thus, two false predictionsoccurring a half-monitoring period apart yields 1discounted false positives. For the remainder of thispaper, the term precision will refer to reduced precision.

    Recall # Target Events Predicted Total Target Events

    , Precision TPTP + FP

    Normalized Precision # Target Events Predicted# Target Events Predicted + FP

    Reduced Precision # Target Events Predicted# Target Events Predicted + Discounted FP

    TP = True Positive Prediction FP = False Positive Prediction

    Figure 2: Evaluation Measures for Event PredictionThe fitness of a pattern is based on both its precision andrecall. However, there are many ways in which these twomeasures can be combined to form the fitness function.Extensive testing was done on synthetic data to evaluatealternative strategies for combining these two measures.Different fixed weighting schemes were tried, but allyielded poor results. Typically, the resulting populationof patterns contained either very precise patterns that eachcovered only a few target events or very imprecisepatterns that covered many of the target events. That is,the GA tended to optimize for precision or recall, but notboth, no matter how the relative weighting of these twomeasures was adjusted. Even the strategy of progressivelyincreasing the relative importance of precision did noteliminate this problem. These fitness functions evenperformed poorly on synthetic data where a single patternexisted that yielded 100% precision and 100% recall.Thus, it is clear that these fitness functions prevent usfrom effectively searching the search space.The solution we adopted is to modify the relativeimportance of precision and recall after each iteration ofthe GA. Specifically, we use an information retrievalmeasure known as the F-measure, which is defined belowin equation 1 (Van Rijsbergen, 1979). The value of ,which controls the relative importance of precision torecall, was changed each iteration of the GA, so that itcycles through the values between 0 and 1 using a step-sizeof .10.

    fitness= ( 2

    2 + 1) precision recall

    precision + recall

    (1)

    4.5 DIVERSITY MAINTENANCE STRATEGYThe diversity maintenance strategy must ensure that a fewprediction patterns do not dominate the population andthat collectively the individuals in the population predictmost, if not all, of the target events in the training set.Our challenge is to maintain diversity without making thesearch too unfocused and to assess diversity efficiently,using a minimal amount of global information that can be

  • efficiently computed. We use a niching strategy calledsharing to maintain a diverse population (Goldberg,1989). Diversity is encouraged by selecting individualsproportional to their shared fitness, a measure that factorsin an individuals fitness as well as its similarity to otherindividuals in the population. The degree of similarity ofan individual i to the n individuals comprising thepopulation is measured by the niche count, defined inequation 2. Experiments using synthetic data led us tochoose a value of 3 for .

    niche counti =

    n

    j 1j)),distance(i - (1 (2)

    The similarity of two individuals is measured using aphenotypic distance measure that measures distance basedon the performance of the individuals at predicting targetevents. The performance of each individual is representedby a Boolean prediction vector of length t that indicateswhich of the t target events in the training set theindividual correctly predicts. The distance between twoindividuals is the fraction of the target events for whichthe two individuals have different predictions, which canbe computed from the number of bit differences in theprediction vectors. The more similar an individual to therest of the individuals in the population, the smaller thedistances and the greater the niche count value; if anindividual is identical to every other individual in thepopulation, then the niche count will be equal to thepopulation size. The shared fitness is defined as the fitnessdivided by the niche count.Note that the distance measure focuses on which targetevents are successfully predictedfalse predictions arenot represented in the prediction vectors. This is not amajor concern because false predictions are alreadypenalized in the fitness function. Our approach exploitsthe fact that for many applications the target events arerare, which greatly reduces the time required to computethe distance measures.We can calculate the number of bit-wise prediction vectorcomparisons required in order to keep the niche countscurrent. Since two new individuals are introduced intothe population each iteration, we must calculate the nichecounts of these two new individuals from scratch eachiteration, and update the niche counts of the remainingindividuals, due to the changes in the population.Assuming there are P individuals in the population, theniche count for each of the two new individuals requiresP-1 prediction vector comparisons, since each patternmust be compared to all other patterns. The niche countsof the remaining individuals can be incrementally updatedby considering, for each individual, the number of new bitdifferences added/eliminated as a result of replacing twoold

    individuals with two new individuals. This incrementalupdate requires 4 prediction vector comparisons perindividual. Thus, a total of 2(P-1) + 4(P-2), or 6P-10,prediction vector comparisons are required each iteration,instead of the P(P-1) that would be required without theincremental updates. For the domains we haveinvestigated, the time required to do these updates is just a

    small fraction of the time required to evaluate the newindividuals.As mentioned earlier, new individuals are selected with aprobability proportional to their shared fitness, where therelative value of precision and recall depends on the valueof for that iteration. The replacement strategy also usesshared fitness, but in this case individuals are choseninversely proportional to their shared fitness. Furthermore,for replacement, the fitness component is computed byaveraging together the F-measure of equation 1 with values of 0, , and 1, so that patterns that perform poorlyon both precision and recall are most likely to be deleted.

    5 CREATING PREDICTION RULESThis section describes an efficient algorithm for orderingthe prediction patterns returned by the genetic algorithmand pruning redundant patterns. The algorithm utilizesthe precision, recall, and prediction vector returned by theGA for each pattern. The algorithm for forming asolution, S, from a set of candidate patterns, C, is shownbelow:1. C = patterns returned from the GA; S = {};2. while C do3. for c C do4. if (increase_recall(S+c, S) THRESHOLD)5. then C = C - c;6. else c.score = PF (c.precision - S.precision) +7. increase_recall(S+c, S);8. done9. best = {c C, xC| c.score x.score}10. S = S || best; C = C - best;11. recompute S.precision on training set;12. done

    This algorithm incrementally builds solutions withincreasing recall by heuristically selecting the bestprediction pattern remaining in the set of candidatepatterns, using the formula on lines 6 and 7 as anevaluation function. Prediction patterns that do notincrease the recall of the solution by at least THRESHOLDare discarded, in order to prevent overfitting of the data.This step will also remove redundant patterns that donot predict any target events not already predicted. Theevaluation function rewards those candidate patterns thathave high precision and predict many of the target eventsnot already predicted by S. The Prediction Factor (PF)controls the relative importance of precision and recall,and increasing it will reduce the complexity of the learnedprediction rule. Experimental results indicate thatselecting patterns solely based on their precision yieldsresults only slightly worse than those produced using thecurrent evaluation function, where PF is set to 10.This algorithm is quite efficient: if p candidate patternsare returned by the GA (i.e., p is the population size), thenthe algorithm requires O(p2) computations of theevaluation function and O(p) evaluations on the trainingdata (step 11). If we assume that there is much moretraining data than individuals in the population and that

  • target events are rare (this affects the time to compute theevaluation function), then the running time of thealgorithm is bounded by the time to evaluate the patternson the training data, which is O(ps), where s is the size ofthe training set. In practice, much less than n iterations ofthe for loop will be necessary, since the majority of theprediction patterns will not pass the test on line 4.

    6 RESULTSThis section describes the results of applying Timeweaverto the problem of predicting telecommunication equipmentfailures and predicting the next command in a sequence ofUNIX commands. The default value of 1% forTHRESHOLD and 10 for PF are used for all experiments.All results are based on evaluation on an independent testset, and, unless otherwise noted, on 2000 iterations of theGA. Precision is measured using reduced precision forthe equipment failure problem, except in Figure 5 wheresimple precision is used to allow comparison with otherapproaches. For the UNIX command prediction problem,reduced and simple precision are identical, since themonitoring period is equal to 1.

    6.1 PREDICTING EQUIPMENT FAILURESThe problem is to predict telecommunication equipmentfailures from historical alarm data. The data contains250,000 alarms reported from 75 4ESS switches, ofwhich 1200 of the alarms indicate distinct equipmentfailures. Except when specified otherwise, allexperiments have a 20-second warning time and an 8-hour monitoring time. The alarm data was broken up intoa training set with 75% of the alarms and a test set with25% of the alarms (each data set contains alarms fromdifferent 4ESS switches). Figure 3 shows the performanceof the learned prediction rules, generated at differentpoints during the execution of the GA. The curve labeledBest 2000 was generated by combining the bestprediction patterns from the first 2000 iterations. Thefigure shows that the performance improves with time.Improvements were not found after iteration 2000.

    0102030405060708090

    100

    0 10 20 30 40 50 60Recall

    Prec

    ision

    Iteration 0Iteration 500Iteration 1000Best 2000

    (4.4,90.0)

    Figure 3: Learning to Predict Equipment Failures

    These results are notablethe baseline strategy ofpredicting a failure every warning time (20 seconds)yields a precision of 3% and a recall of 63%. The curvesgenerated by Timeweaver all converge to this value, sinceTimeweaver essentially mimics this baseline strategy tomaximize recall. A recall greater than 63% is neverachieved since 37% of the failures have no events in theirprediction period. The prediction pattern correspondingto the first data point for the Best 2000 curve in Figure 3is: 351:*

  • class distribution. As a result, neither C4.5rules norRIPPER predicts any failures when their defaultparameters are used. To compensate for the skeweddistribution, various values of misclassification cost (i.e.,the relative cost of false negatives to false positives) weretried and only the best results are shown in Figure 5.Note that in Figure 5 the number after the w indicates thewindow size and the number after the m themisclassification cost.FOIL is a more natural learning system for eventprediction problems, since it can represent sequenceinformation using relations such as successor(E1, E2) andafter(E1, E2), and therefore does not require anysignificant transformation of the data. FOIL provides noway for the user to modify the misclassification cost, sothe default value of 1 is used.

    0102030405060708090

    100

    0 2 4 6 8Recall

    Prec

    ision

    timeweaverc4.5 (w3 m5)ripper (w3 m20)FOIL

    Figure 5: Comparison with Other ML MethodsC4.5rules required 10 hours to run for a window size of 3.RIPPER was significantly faster and could handle awindow size up to 4; however, peak performance wasachieved with a window size of 3. FOIL produced resultsthat were generally inferior to the other methods. Allthree learning systems achieved only low levels of recall(note the limited range on the x-axis). For C4.5rules andRIPPER, increasing the misclassification cost beyond thevalues shown caused a single rule to be generateda rulethat always predicted the target event. Timeweaverproduces significantly better results than these otherlearning methods and also achieves higher levels of recall.Timeweaver can also be compared against ANSWER, theexpert system responsible for handling the 4ESS alarms(Weiss, Ros & Singhal, 1998). ANSWER uses a simplethresholding strategy to generate an alert whenever morethan a specified number of interrupt alarms occur within aspecified time period. These alerts can be interpreted as aprediction that the device generating the alarms is goingto fail. Various thresholding strategies were tried and thethresholds generating the best results are shown in Figure 6.Each data point represents a thresholding strategy. Notethat increasing the number of interrupts required to hit thethreshold decreases the recall and tends to increase theprecision. By comparing these results with those ofTimeweaver in Figure 3, one can see that Timeweaveryields superior results, with a precision often 3-5 timeshigher for a given recall value. Much of this improvement

    is undoubtedly due to the fact that Timeweavers patternlanguage is much more expressive than a simplethresholding language.

    6789

    1011121314151617

    0 10 20 30 40 50Recall

    Prec

    ision

    threshold duration: 4 hrthreshold duration: 1 day

    2 in 1 day3 in 1 day

    4 in 1 day

    1 interrupt in 1 day

    1 interrupt in 4 hours

    2 in 4 hrs

    9 in 1 day

    3 in 4 hrs

    6 in 1 day4 in 4 hrs

    7 in 4 hrs

    Figure 6: Using Thresholds to Predict Failures

    6.3 PREDICTING UNIX COMMANDSIn order to demonstrate the effectiveness of our systemacross multiple domains, we applied our system to thetask of predicting whether the next UNIX command in alog of commands is the target command. The time atwhich each command was executed is not available, sothis is a sequence prediction problem (thus the warningand monitoring times are both set to 1). Note thatTimeweaver can solve sequence prediction tasks becausethey may be considered a special case of an eventprediction task. The dataset contains 34,490 UNIXcommands from a single user. Figure 7 shows the resultsfor 4 target commands. Timeweaver does better, andexcept for the more command much better, than a non-incremental version of IPAM, a probabilistic method thatpredicts the most likely next command based on theprevious command (Davison & Hirsh, 1998). The resultsfrom IPAM are shown as individual data points. The firstprediction pattern in the prediction strategy generated byTimeweaver to predict the ls command is the pattern:6:cd.?.cd.?.cd (the pattern duration of 6 means the matchmust occur within 6 events). This pattern matches thesequence cd ls cd ls cd, which is likely to be followed byanother ls command.

    0

    10

    20

    30

    40

    50

    60

    70

    0 10 20 30 40 50 60 70 80 90 100Recall

    Prec

    ision

    cdmorelsw

    w

    cd

    ls more

    Figure 7: Predicting UNIX Commands

  • 7 RELATED RESEARCHOur system performs prediction, which may be viewed asa type of classification task, and hence our system sharesmuch in common with classifier systems (Goldberg,1989). However, due to the simplicity of our rules, wefeel our work has more in common with other genetic-based systems that use similarly simple rules to classifyexamples (Greene & Smith, 1993; De Jong, Spears &Gordon, 1993; McCallum & Spackman, 1990), as well asevolutionary methods that build decision trees to classifyexamples (Marmelstein & Lamont, 1998; Ryan &Rayward-Smith, 1998).Transforming the event sequence data into an unorderedset of examples permits existing concept-learningprograms to be used (Dietterich & Michalski, 1985).These programs handle categorical features and we usedthis technique in Section 6 so that C4.5 and RIPPER, twopopular machine learning program, could be applied tothe event prediction problem. This approach has alsobeen used within the telecommunication industry toidentify recurring transient network faults (Sasisekharan,Seshadri & Weiss, 1996) and to predict catastrophicequipment failures (Weiss, Eddy & Weiss, 1998). Theproblem with these techniques is that the transformationprocess will lose some sequence and temporal informationand one does not know aprioi what information is useful.

    8 CONCLUSIONThis paper investigated the problem of predicting rareevents with categorical features from event sequence data.We showed how the rare event prediction problem couldbe formulated as a machine learning problem and howTimeweaver, a genetic-based machine learning system,could solve this class of problems by identifyingpredictive temporal and sequential patterns directly in theunmodified event sequence data. This approach wascompared to other machine learning approaches andshown to outperform them.

    AcknowledgmentsThanks to members of the Rutgers machine learningresearch group, and especially Haym Hirsh, for feedbackon this work.

    ReferencesBrockwell, P. J., and Davis, R. 1996. Introduction to

    Time-Series and Forecasting. Springer-Verlag.Cohen, W. 1995. Fast Effective Rule Induction. In

    Proceedings of the Twelfth International Conference onMachine Learning, 115-123.

    Davison, B., and Hirsh, H. 1998. Probabilistic OnlineAction Prediction. In Proceedings of the AAAI SpringSymposium on Intelligent Environments.

    De Jong, G. A., Spears, W. M., and Gordon, D. 1993.Using Genetic Algorithms for Concept Learning.Machine Learning. 13:161-188.

    Dietterich, T., and Michalski, R. 1985. Discoveringpatterns in sequences of Events, Artificial Intelligence,25:187-232.

    Giordana, A., Saitta, L., and Zini, F. 1994. LearningDisjunctive Concepts by Means of Genetic Algorithms.In Proceedings of the Eleventh InternationalConference on Machine Learning, 96-104.

    Goldberg, D. 1989. Genetic Algorithms in Search,Optimization and Machine Learning, Addison-Wesley.

    Greene, D. P., and Smith, S. F. 1993. Competition-BasedInduction of Decision Models from Examples. MachineLearning. 13: 229-257.

    Marmelstein, R, and Lamont, G. 1998. PatternClassification using a Hybrid Genetic Program-DecisionTree Approach. In Proceedings of the Third AnnualGenetic Programming Conference, 223-231.

    McCallum, R., and Spackman, K. 1990. Using geneticalgorithms to learn disjunctive rules from examples. InProceedings of the Seventh International Conference onMachine Learning, 149-152.

    Neri, F. & Saitta, L., 1996. An analysis of the UniversalSelection Suffrage Operator. Evolutionary Computation,4(1): 87-107.

    Quinlan, J. R. 1993. C4.5: Programs for MachineLearning. San Mateo, CA: Morgan Kaufmann.

    Quinlan, J. R., 1990. Learning Logical Definitions fromRelations, Machine Learning, 5: 239-266.

    Ryan, M. D., and Rayward-Smith, V. J. 1998. Theevolution of decision trees. In Proceedings of the ThirdAnnual Genetic Programming Conference, 350-358.

    Sasisekharan, R., Seshadri, V., and Weiss, S. 1996. Datamining and forecasting in large-scale telecommunicationnetworks, IEEE Expert, 11(1): 37-43.

    Syswerda, G. 1990. In the First Workshop on theFoundations of Genetic Algorithms and ClassificationSystems, Morgan Kaufmann.

    Van Rijsbergen, C. J. 1979. Information Retrieval,Butterworth, London, second edition.

    Weiss, G. M., Eddy, J., Weiss, S., and Dube., R. 1998.Intelligent Technologies for Telecommunications. InIntelligent Engineering Applications, Chapter 8, CRCPress, 249-275.

    Weiss, G. M., Ros J. P., and Singhal, A. 1998.ANSWER: Network Monitoring using Object-OrientedRules. In Proceedings of the Tenth Conference onInnovative Applications of Artificial Intelligence,Madison, Wisconsin.

    Weiss, G. M., and Hirsh, H. 1998. Learning to PredictRare Events in Event Sequences. In Proceedings of theFourth International Conference on KnowledgeDiscovery and Data Mining, AAAI Press, 359-363.