sparse sequence modeling with applications to computational...

147
Sparse Sequence Modeling with Applications to Computational Biology and Intrusion Detection Eleazar Eskin Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Graduate School of Arts and Sciences COLUMBIA UNIVERSITY 2002

Upload: others

Post on 21-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

  • SparseSequenceModeling

    with Appli cationsto Computational Biology and Intrusion Detection

    EleazarEskin

    Submittedin partialfulfillment of the

    requirementsfor thedegree

    of Doctorof Philosophy

    in theGraduateSchoolof Arts andSciences

    COLUMBIA UNIVERSITY

    2002

  • c�

    2002

    EleazarEskin

    All RightsReserved

  • ABSTRACT

    SparseSequenceModeling

    with Appli cationsto Computational Biology and Intrusion Detection

    EleazarEskin

    Sequencemodelshave beenstudiedfor sometime in differentcontexts including language parsingandanalysis,genomics, andrecentlyin computer securityin theareaof intrusiondetection. Many of thesesequences can be characterizedas ”sparse”,that is only a fraction of the elementsof the sequencehavemeaningful value. This is thecasein many practicalapplications, suchastheanalysisof DNA sequences,whereit is postulatedthat only about 1-3% of the sequence hasany biological significance.Similarly, inintrusiondetection, typically the evidence that an audit streamfrom a systemcontainsan attackis oftenburiedin avastamountof irrelevant information.Modelingsparsesequencesoftenrequiresallowing “softer”matchesbetweena sequenceanda canonical model suchas allowing for mismatches.For example, theclassicalDNA signal“TATAAT” canoftenoccurwith several mismatchesin any positionsuchasoccurrences“TATCAT” or “TAAAAT”. Computationally, this is problematic because thereis an exponential numberof models which can matcha given sequence. Thus naive approachesto sparsesequencemodeling arecomputationallycomplex in bothtimeandspace.

    We present a new efficient framework for approachingsparsesequence modeling problems. Wepresenttechniquesusingthis framework to addressthreecomputationalproblems: classificationor trans-duction, outlier detection,andsignalfinding. Specifically, we demonstratethis framework with threeap-plications,classificationof aminoacid sequences into proteinfamilies,outlier detectionover sequencesofsystemcalls for intrusiondetection, andsignalfinding for discovering transcription factorbinding sitesingenepromoterregions.This framework employsefficientdatastructureswhichindex thesequencesto allowiteratingover all possiblesparsemodels of thesequence. We modify several learning algorithms includingBoosting,Support VectorMachines,anda new setof outlierdetectionalgorithmsto take advantageof thesedatastructures. While still considering as rich a setof models as the naive approacheswe canavoid theintractable time andspacerequirements.

  • Contents

    List of Tables v

    List of Figures vii

    Acknowledgements ix

    I Intr oduction

    Chapter 1 Intr oduction 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 SparseSequenceProblems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    1.2.1 ProteinFamily Classification. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2.2 SystemCall Anomaly Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2.3 Motif Findingin DNA Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

    1.3 A SparseSequenceModelingFramework . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.3.1 SparseModels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

    1.3.1.1 PrefixandSuffix Models . . . . . . . . . . . . . . . . . . . . . . . . . . 51.3.1.2 Wild-card Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.3.1.3 MismatchModels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.3.1.4 TriggerModels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.3.1.5 FeatureSpaces:Spectrumsvs. Subsequences. . . . . . . . . . . . . . . . 6

    1.3.2 DataStructures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.3.2.1 MismatchTreeDataStructure . . . . . . . . . . . . . . . . . . . . . . . . 71.3.2.2 SMT Datastructures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    1.3.3 Classification/Transductionof SparseSequences . . . . . . . . . . . . . . . . . . . 81.3.4 OutlierDetectionfor SparseSequences . . . . . . . . . . . . . . . . . . . . . . . . 91.3.5 SignalFindingin SparseSequences . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    1.4 Contributions of this Thesis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.5 Organizationof Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

    II Classificationof SparseSequences 12

    Chapter 2 Lar geMar gin Prediction Trees 142.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

    i

  • 2.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.3 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.4 Learningwith Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

    2.4.1 OriginalAlgorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.4.2 EfficientExtension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.4.3 Context Priorsfor Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

    2.5 Learningwith SVMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.5.1 SVMs overSparseSequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

    2.6 EfficientDataStructures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.6.1 MismatchTreeDataStructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.6.2 Advantagesof theMismatchTreeApproach. . . . . . . . . . . . . . . . . . . . . . 212.6.3 Example of MismatchTreeDataStructure. . . . . . . . . . . . . . . . . . . . . . . 212.6.4 OtherSparseModelDataStructures . . . . . . . . . . . . . . . . . . . . . . . . . . 22

    2.7 Experiments:ProteinClassification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

    Chapter 3 SparseMark ov Transducers 263.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.2 Comparisonto ProbabilisticSuffix Trees. . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.3 SparseMarkov Transducers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

    3.3.1 Sparseprediction trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.3.2 Traininga Prediction Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

    3.4 SparseMarkov chainsassparseprediction trees . . . . . . . . . . . . . . . . . . . . . . . . 313.5 Mixture of SparsePredictionTrees. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.6 Priordistributionof sparsepredictiontrees. . . . . . . . . . . . . . . . . . . . . . . . . . . 323.7 WeightUpdateAlgorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.8 Proofof Claim1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.9 ImplementationIssues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.10 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423.11 PfamExperiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.12 SCOPExperiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453.13 Efficiency Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463.14 Discussion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

    Chapter 4 Mixtur eof CommonAncestors 534.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554.3 TheCommonAncestorModel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564.4 Mixturesof CommonAncestors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574.5 IncorporatingDefaultModels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594.6 UsingBiological Informationto SetthePriors . . . . . . . . . . . . . . . . . . . . . . . . . 604.7 EstimatingtheProbabilities of Amino Acids . . . . . . . . . . . . . . . . . . . . . . . . . . 604.8 Experimentswith ProteinFamilies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624.9 ComputingthePredictionof a SingleAncestorModel . . . . . . . . . . . . . . . . . . . . . 634.10 Batchcomputationof themixture weights . . . . . . . . . . . . . . . . . . . . . . . . . . . 664.11 Discussion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

    ii

  • III Outlier Detectionover SparseSequences 69

    Chapter 5 A GeometricFramework for Outlier Detection 715.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 715.2 UnsupervisedAnomaly Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 735.3 A GeometricFramework for UnsupervisedAnomaly Detection. . . . . . . . . . . . . . . . 74

    5.3.1 FeatureSpaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 745.3.2 KernelFunctions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 755.3.3 Convolution Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

    5.4 DetectingOutliersin FeatureSpaces. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 765.5 Algorithm 1: Cluster-basedEstimation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 775.6 Algorithm 2: K-NearestNeighbor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 775.7 Algorithm 3: OneClassSVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 785.8 FeatureSpacesfor IntrusionDetection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

    5.8.1 Data-dependentNormalizationKernels . . . . . . . . . . . . . . . . . . . . . . . . 805.8.2 Kernelsfor Sequences:TheSpectrum Kernel . . . . . . . . . . . . . . . . . . . . . 81

    5.9 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 825.9.1 Performancemeasures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 825.9.2 DataSetDescriptions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 835.9.3 ExperimentalSetup. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 835.9.4 ExperimentalResults. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

    5.10 Discussion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

    Chapter 6 Dynamic Window Sizesfor SystemCalls 886.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 886.2 ProgramCall Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 896.3 SparseMarkov Transducers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 916.4 Experimentsover Audit Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

    6.4.1 BaselineComparisonMethods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 936.4.2 ExperimentalResults. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

    6.5 Discussion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

    IV Signal Finding in SparseSequences 96

    Chapter 7 Mismatch TreeApproachto Dyad Signal Finding 987.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 987.2 MonadPatternDiscovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 997.3 MismatchTreeAlgorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

    7.3.1 SplittingPatternSpace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1017.3.2 MismatchTreeDataStructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1027.3.3 IncorporatingPairwiseSimilarity into theSampleDrivenApproach . . . . . . . . . 1047.3.4 Improvementsover theWINNOWER . . . . . . . . . . . . . . . . . . . . . . . . . 104

    7.4 Discovering DyadSignals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1057.5 Tests. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

    7.5.1 ScoringPatterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

    iii

  • 7.5.2 SimulatedData . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1077.5.3 MonadMotifs in DNA Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . 1077.5.4 CompositeMotifs in DNA Sequences . . . . . . . . . . . . . . . . . . . . . . . . . 1087.5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

    Chapter 8 GenomeWide Analysisof Regulatory Regionsin Bacteria 1118.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1118.2 SampleGeneration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1138.3 FindingSignals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1138.4 ScoringSignals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1148.5 FindingPutativeRegulatory Elements in BacterialGenomes . . . . . . . . . . . . . . . . . 1168.6 Discussion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

    V Conclusion 120

    Chapter 9 Conclusion 1219.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1219.2 Summaryof TheoreticalResults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

    9.2.1 MachineLearningResults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1219.2.2 OutlierDetectionResults. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1229.2.3 SignalFindingResults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

    9.3 Summaryof ApplicationResults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1229.3.1 ProteinFamily Classification. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1229.3.2 Audit StreamAnalysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1239.3.3 Motif-Finding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

    9.4 FutureWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1239.4.1 GeneralFramework FutureDirections . . . . . . . . . . . . . . . . . . . . . . . . . 1239.4.2 Classificationof SparseSequenceFutureDirections . . . . . . . . . . . . . . . . . 1249.4.3 OutlierDetectionof SparseSequences. . . . . . . . . . . . . . . . . . . . . . . . . 1249.4.4 Motif Finding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1249.4.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

    iv

  • List of Tables

    3.1 Numberof nodesafterasingletrainingexamplewithoutefficientdatastructures.Thenumberof nodesgeneratedperexamplesincreasesexponentiallywith � max. . . . . . . . . . . . . . 40

    3.2 Time-Space-Performancetradeoffs for the SMT family model trainedon the ABC trans-portersfamily which containeda total of 330sequences. Time is measuredin secondsandspaceis measuredin megabytes. The normalandefficient columns refer to the useof theefficient sequence-baseddatastructures.Becauseof memory limitations,without usingtheefficient datastructures,many of themodels with high valuesof theparametervalueswereimpossibleto compute (indicatedwith –). . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

    3.3 Resultsof ProteinClassificationusingSMTs. Theequivalencescoresscoresareshown foreachmodelfor thefirst 50 familiesin thedatabase.Theparametersto build themodelswere�

    max ��� , � max ��� for theSMT prediction modelsand � max �� and � max �� for theSMT classifiermodel. Two tailedsignedrankassignsap-valueto thenull hypothesisthatthemeansof thetwo classifiersarenotequal.Thebestperformingmodelis theSMT Classifier,followedby theSMT Predictionmodelfollowedby thePSTPredictionmodel. Thesignedranktestp-valuesfor thesignificancebetweentheclassifiersareall � �� . . . . . . . . . . 49

    3.4 Resultsof ProteinClassificationusingSMTs.(continued). . . . . . . . . . . . . . . . . . . 503.5 Resultsof ProteinClassificationusingSMTs.(continued). . . . . . . . . . . . . . . . . . . 513.6 Performanceof SMT comparedto BLAST, HMMER, andFisherkernel. Thetableis sorted

    with respectto therationof positiveto negativeexamplesin thetraining set.As expected,forextremelyskewedratios,theSMT performancedegradessignificantly. . . . . . . . . . . . . 52

    4.1 Excessentropy for differentprobability estimationmethodsunderdifferentsamplesizesoverBLOCKS database.Theencoding costswerecomputedover theBLOCKS databasefor CA-Mixture, Zero-1, Zero-0.0481, andPseudoandwereconsistentwith previously publishedresults. The encoding costsfor Dirichlet-S andDirichlet-K reported arepublished results[72]. Note that the Dirichlet mixtures wereobtained with a parameter searchin order tominimizeentropy overthedataset,while themixtureof common ancestorsresultsarewithoutparameteroptimization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

    5.1 Featurein KDD CupNetwork connectionrecords. For detailson thefeaturesin thedatasetsee[81]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

    5.2 Lincoln LabsDataSummary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 875.3 Selectedpointsfrom the ROC curvesof the performanceof eachalgorithm over the KDD

    Cup1999 Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

    v

  • 6.1 Lincoln LabsDataSummary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 926.2 Universityof New Mexico DataSummary. . . . . . . . . . . . . . . . . . . . . . . . . . . 92

    7.1 Theperformanceof PDA, SDA, MITRA-Count andMITRA-Graphon syntheticdata. TheCPU time is given in minutesandthe memory usageis given in megabytes. In all experi-ments,� ������� , � ����� andthesignaloccursin all of thesequences( � ����� ). Blankentries“–” or entriesin italics denotetheinability for thealgorithmto solve thechallenge problembecauseof a lack of memory or CPU resources. The italics entriesareestimatesof the re-sourcesnecessaryto solvetheproblem. Thelastcolumn givesthepercentageof thepossiblesedgesthatexist in thegraph attherootnodeof MITRA-Graph. Notethatthenumberof edgesis abetterindicator for determining thememory andCPUusagethanparameters � and � . Allexperimentswereperformedonmachinewith aPentiumIII 750GHzprocessorand1 GB ofRAM. In thecaseswhenPDA takestoo muchtime to completethesearch,we estimateditsrunning time basedon timing a portion of its search.We werenot ableto run SDA searchfor patterns of length ��� ��� becauseof memory considerations.Evenat this point the im-plementationwastricky becausewe compressa stringof length �� into a � � bit integer. Atlength ��� , this breaks down andthe memory requirementsandrunning time increaseevenfurther. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

    7.2 Theperformanceof MITRA for biological sampleswith monadmotifs from [23]. For eachsample,thepredictionof MITRA is shownaswell asthe ���! "�$#&%'� parametersof thepredictedsignal. The nucleotides in the predicted patterns that matchthe actualbinding site are inbold. In somecases,therewereseveral strongmotifs (in additionto the onediscussedin[23]), the number of which is listed. The shown motif is the oneclosestto the biologicalsample.References: (A) preproinsulinpromoter region motif [132]. (B) DHFR non-TATAtranscriptionstart signal [91]. (C) MREa promoter [5]. (D)c-fos serumresponseelement[94]. (E) yeastearlycell cyclebox [90]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

    7.3 Dyadsignalsfrom P. horikoshii [46]. Thesecondto last row shows thepatternpredictedbyMITRA-Dyad. Thelastrow shows theconsensuspatternwhich is generatedby choosingthemostcommonnucleotide from theinstancesof thepatternateachposition. . . . . . . . . . 109

    7.4 URS1andUASH motifs from [51]. Binding sitesandpositionsfor thegeneupstreamfromyeastareshown wherethe two componentsof thecompositepatternoccurwithin 50 bases.All positions arerelative to theannotated translationstartsitesof therespective genes. Dis-tancesbetweenbinding sitesaregiven. A predictionthat overlaps with the actualsite isconsideredcorrect. Six sequences (not shown) werenot analyzed because the URS1 siteandUASH site is morethan50 basesapart. The last row shows the top scoringpatternofMITRA-Dyad. Thetop3 rankedpatterns wereminorvariantsof theshown pattern. . . . . . 110

    8.1 GenomeIntergenicRegion Statistics.Thefirst columnis nameof thegenome. TheID is anabbreviation for thegenome. Thenext columns list thenumber andthelengths of intergenicregions for divergentandconvergentsamples.Thelastcolumn describegenometaxonomy. 117

    8.2 Top scoringdyadsignalsin 20 bacterialgenomesUnderlined signalsareparticularly strong(strengthscoresgreaterthan ��� anddyadscoresgreaterthan ( � ). TheSignalClasscolumnlabelsif thesignalfalls into a known biological signal. Classesaredefinedas(PB) Gilbert-Pribnow signal, (PB*) variantGilbert-Pribnow signal, (CRP) CRP signal, (CRP*) variantCRPsignal,(PU)possiblepalindromicsignalfor anunknown factor. . . . . . . . . . . . . 119

    vi

  • List of Figures

    1.1 A MismatchTreefor a sequenceAGTATCAGTT correspondingto a searchfor ��() � # motifs.Thepaths from theroot to thenodes definethe labelsof thenodes. The leaf nodescontain(i) thenumber of mismatchesbetweentheprefixof a substringin thesequence andthelabelof the nodeand(ii) a pointer to the tail of the instance. (a) The tree is in its initial state.(b) The treeafterexpanding thepath * . (c) The treeafterexpandingthepath *�+ . Noticethatmany of theinstancesreachthemaximum numberof allowedmismatches andwerenotpasseddown to their childrennodes.For clarity, notall of thepointersaredrawn in (c). . . 8

    1.2 Efficient datastructuresfor SMTs. The boxes represent input sequenceswith their cor-responding output. (a) The treewith , max �.- after input sequences *0/ � *0/1*0/2�3*0# ,� *4/1* � *4/2�3/'# , � *4/1*�*�*4/2� � # , *4/1*0/ � *4/2��*0# and * � /1* � *0/2� � # . (b) Thetreeafterinputsequence *0/ � *0/1*0/2� � # is added.Notethatanodehasbeenexpandedbecauseof theaddition of theinput. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    2.1 An ��() � # -mismatch treefor a sequenceAVLALKAVLL usedfor computing the kernel ma-trix with � -mersof length ( allowing � mismatch.Thepathfrom theroot to thenodeis the“prefix” of thepatternsthat correspondto the node. The leaf nodescontain(i) the numberof mismatchesbetweentheprefix of a substringin thesequenceandtheprefix of thecorre-sponding patterns and(ii) a pointerto the tail of the substring. In addition, the leaf nodescontaininformation for generatingthekernelmatrix, suchasthesourceof the instanceandits weight. This additional informationis not shown in theexample for clarity. (a) Thetreeis in its initial statecontaining just therootnode.(b) Thetreeafterexpandingthepath * . (c)Thetreeafterexpanding thepath *�5 . Noticethatmany of theinstancesreachthemaximumnumber of allowedmismatchesandwerenot passeddown to their children nodesasshownin (c). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

    2.2 Comparisonof four homology detection methods. Thegraphplotsthetotal numberof fami-lies for whichagivenmethodexceedsanROC687 scorethreshold. Eachseriescorrespondstooneof thehomology detectionmethodsdescribedin thetext. . . . . . . . . . . . . . . . . 24

    2.3 Family-by-family comparisonof the � - � # -mismatch-SVM with Fisher-SVM. The coordi-natesof eachpoint in theplot aretheROC687 scoresfor oneSCOPfamily, obtainedusingthemismatch-spectrum-SVM with � ��- , 9 �:� (x-axis) andFisher-SVM (y-axis). Thedottedline is ; �=< . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

    vii

  • 2.4 Family-by-family comparisonof thespectrumSVM with andwithout mismatches. Theco-ordinatesof eachpoint in theplot aretheROC687 scoresfor oneSCOPfamily, obtained usingthespectrum SVM with � � � (x-axis)andmismatch-spectrum� ��- 9 �>� (y-axis). Thedottedline is ; �?< . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

    3.1 An illustrationof a sparseprediction tree.For spaceconsiderationswedonotdraw branchesfor all 20aminoacids. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

    3.2 A Sparseprediction treederivedfrom a sparseMarkov chain.. . . . . . . . . . . . . . . . . 313.3 A sparsepredictiontreewith its generativeprobabilities. . . . . . . . . . . . . . . . . . . . 343.4 An illustrationof a mixture of sparsesuffix treesfor

    �max �� and � max �@� . In order to

    simplify thefigureweassumethattheinputalphabetconsistsof threesymbols A�*B "/4 �DC . . 383.5 The templatetreefor a mixture of sparsepredictiontreeswith

    �max �E� and � max �F� .

    For spaceconsiderationswedonotdraw branchesfor all 20aminoacids. . . . . . . . . . . 393.6 Thetemplatetreeof Figure3.5afterprocessinginput sequences*0* , *4/ and / � . . . . . . 393.7 Efficient Data Structures for SMTs. The boxes representinput sequences with their cor-

    responding output. (a) The treewith , max �.- after input sequences *0/ � *0/1*0/2�3*0# ,� *4/1* � *4/2�3/'# , � *4/1*�*�*4/2� � # , *4/1*0/ � *4/2��*0# and * � /1* � *0/2� � # . (b) Thetreeafterinputsequence *0/ � *0/1*0/2� � # is added.Notethatanodehasbeenexpandedbecauseof theaddition of theinput. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

    3.8 Scatterplotsof EquivalenceScoresfrom threemodels of protein classification. (a) SMTClassifiervs. PST. (b) SMT Classifiervs. SMT Generative. (c) SMT Generativevs. PST. . . 48

    4.1 A mutationmatrix derived from a BLOSUM50matrix. Notice that all of the diagonal ele-mentscontain 0’s. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

    4.2 Comparisonof Mixture of CommonAncestors andPseudoCounts appliedto proteinfamilyclassification.For eachmethod, wecompute theaccuracy for eachproteinfamily. . . . . . 64

    6.1 A samecall graph andexecution trace. Note the execution pathis markedon the graphinbold. Thesystemcall traceof theexecution pathis thesetof systemcallsalongtheedgesofthegraph andin this case:9G9GH�I , J < JKML$J , NPO&KM�!� , NPO&KM�!� , KQ��O�R�J . . . . . . . . . . . . . . . . . 90

    6.2 A portionof a call graph corresponding to a singlesystemcall branch. Notethesystemcallsubsequencecorresponding to this call graphis “ioctl mmap* mmapunlink”. . . . . . . . . 91

    7.1 A MismatchTreefor a sequenceAGTATCAGTT correspondingto a searchfor ��() � # motifs.Thepaths from theroot to thenodes definethe labelsof thenodes. The leaf nodescontain(i) thenumber of mismatchesbetweentheprefixof a substringin thesequence andthelabelof the nodeand(ii) a pointer to the tail of the instance. (a) The tree is in its initial state.(b) The treeafterexpanding thepath * . (c) The treeafterexpandingthepath *�+ . Noticethatmany of theinstancesreachthemaximum numberof allowedmismatches andwerenotpasseddown to their childrennodes.For clarity, notall of thepointersaredrawn in (c). . . 103

    8.1 Histogramof SeparationDistancesandPositionsfor thePribnow-GilbertdyadsignalTTGACA-17-TATAAT. Separation Distancesin (a)BSgenomeand(b) ECgenome. Positionsin (c) BSgenome (d) EC genome. We bin thepositionsof thesignalinstancesinto buckets(30 bp bydefault)sincepositionsrarelyexactlymatchandtheexacttranscription startpositionis oftenunknown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

    viii

  • Acknowledgements

    Therearemany people who I would like to thankfor their support throughoutmy graduateexperience,buttwo standout asthemostimportant. TheseareSalStolfoandYoramSinger. I amgrateful for their support,friendship,advice,teachingmetheimportanceof impactin my research,andtheexample they setfor metofollow.

    I wouldalsoliketo thankmany people whoI workedwith atColumbia andelsewhereespeciallyBillNoble,Pavel Pevzner, Luis Gravano, ChristinaLeslie,JudithKlavans,EugeneAgichtein, Shlomo Hershkop,Matt Schultzandthemany studentsthatworkedontheintrusiondetectionsystemsprojectwhomademy timeat Columbiamuchmorefun. In addition, I would like to thankmy committeeof SalStolfo, Luis Gravano,Tony Jebara,Bill Noble,andYoramSingerfor theextensivecomments onanearlierdraftof this thesis.

    Finally, I would like to thankmy family for their support andmy friendsin new york including theColumbiaWaterPoloteamfor providing themuch neededdistractions necessaryto completea Ph.D.thesis.

    ELEAZAR ESKIN

    COLUMBIA UNIVERSITYApril 2002

    ix

  • Part I

    Intr oduction

  • CHAPTER1. INTRODUCTION 1

    Chapter 1

    Intr oduction

    1.1 Intr oduction

    Modeling sequencesis a fundamentalproblem in computer science.This problem hastremendousappli-cationsacrossmany fields. The approachesto sequence modeling alsocomefrom various fields. TheseincludeapproachessuchasHiddenMarkov Models [76] from the speechprocessingcommunity, Markovchainsfrom statistics,andprobabilisticsuffix trees[107] from themachine learningcommunity. Recently,modern biology hasbeenrevolutionizedby theavailability of genomic sequencesfor many organisms.Themodeling andanalysisof thesebiological sequencessuchastheDNA sequence of anorganismor theaminoacidsequenceof proteinspresents anunprecedented opportunity for understandingfundamentalbiologicalprocessesanddisease[33, 101]. Similarly, securityof computer systemscangreatlybe enhancedby theanalysisof auditdatacollectedfrom thesesystems.By classifyingwhatbehavior constitutesmalicious be-havior andwhatbehavior constitutesnormal use,we candetectmaliciousbehavior andintrusions whichcangreatlyenhancenetwork security[79]. Therearealsoahostof otherpotential application areasfor sequencemodeling includingnatural languageprocessing,databasesandtimeseriesanalysis.

    Many of thesesequences canbecharacterizedas”sparse”,that is only a fractionof theelementsofthesequencehave meaningful value.This is thecasein many practicalapplications,suchastheanalysisofDNA sequences,where it is postulatedthatonly about 1-3%of thesequencehasany biological significance.Similarly, in intrusion detection, typically theevidence thatanauditstreamfrom a systemcontains anattackis often buried in a vastamount of irrelevant information. Furthermore,modeling sparsesequencesoftenrequires allowing ”softer” matches betweena sequenceanda canonical model suchasallowing for mis-matches.For example, theclassicalDNA signal”TATAAT” canoftenoccurwith severalmismatches in anypositionsuchasoccurrences“TATCAT” or “TAAAAT”. Computationally, this is problematicbecausethereis an exponentialnumber of modelswhich canmatcha given sequence.Thusnaive approachesto sparsesequence modeling arecomputationallycomplex in bothtimeandspace.

    Differentapplications requiredifferent models of “sparseness.” For biological sequences,becauseof frequentmutationsmismatchmodelswhich allow for a limited number of mismatcheswithin a sequenceareeffectivefor modeling thesesequences.For securityapplications,becauseoftentheevidenceof anattackis buriedbetweenirrelevant symbolsin a sequence, trigger models which canexplicitly model this typeofsequence areeffective.

    In thisthesis,wepresent acommonframework for modelingsparsesequences. Specificallywefocuson threedifferentcomputationalproblemsappliedto sparsesequences:transduction or classification,outlier

  • CHAPTER1. INTRODUCTION 2

    detection, andsignalfinding. The transduction problem mapsa input sequenceto an output sequence. Iftheoutputsequencesarealwaysonly a singlesymbol,theproblemreducesto a classificationproblem. Theoutlier detectionproblemdetermineswhich sequencesareoutliersrelative to theremaining sequences. Thesignalfindingproblem determinesthestrongest“signals” or shortregions in thesequences.

    Thecommon threadthrough our approachto all of theseproblemsis a general framework for mod-eling sparsesequenceproblems.This includesa general framework for modeling differenttypesof “sparse-ness”andtheuseof a classof efficient datastructuresto mitigatethecomputationalcomplexity of handlingsparsesequences. Thesedatastructuresindex thedatain away to facilitatemodeling thesequences.

    For eachof theproblems,we introduceasettingappropriateto applying our framework.

    1.2 SparseSequenceProblems

    Important sequence modeling problemsspanmany subfields of computer science.In this thesiswe focuson problemswherethesequencesareinherently sparse,meaningthatonly a relatively small portion of thesequence is critical for themodeling. In sparsesequenceproblems,typically thesecritical portionsalsodonotoccurexactly thesamewayacrossmultiplesequences,but occurwith somevariation.

    In this thesiswe focuson threeapplicationsto motivateour approach,eachof which requires thesolutionto adifferentcomputationalproblem.Weuseclassificationtechniquesto addresstheproteinhomol-ogy problem. We useoutlier detectiontechniquesto detectanomalous systemcall traces.Finally, we usesignalfindingtechniquesto detectmotifs in DNA sequences.

    1.2.1 Protein Family Classification

    As databasesof proteinsclassifiedinto familiesbecomeincreasingly available,andas the number of se-quencedproteins grows exponentially, techniquesto automaticallyclassifyunknown proteinsinto familiesbecome more important. Proteinsthat are in the samefamily areoften referred to as horologiesand theproteinfamily classificationproblem is oftenreferred to asprotein homology detection. Many approacheshavebeenpresentedfor proteinclassification.Initially theapproachesexaminedpairwisesimilarity [128, 3].Otherapproachesto proteinclassificationarebasedoncreatingprofilesfor protein families[48], thosebasedonconsensuspatternsusingmotifs[12, 9] andHMM-based(HiddenMarkov Model)approaches[76, 34, 13].

    The proteinclassificationproblem is a supervisedlearning problem. We aregiven a setof aminoacidsequencescorresponding to proteins,eachclassifiedinto a proteinfamily. Our input sequencesaretheaminoacid sequencesandthe labelsarethe families. Our goal is to learna classifierthat takesasinput aprotein’s aminoacidsequenceandpredictsits family.

    Threesampleproteinsareshown below

    ID 110K_PLAKN

    FNSNMLRGSV CEEDVSLMTS IDNMIEEIDF YEKEIYKGSH SGGVIKGMDY DLEDDENDED

    EMTEQMVEEV ADHITQDMID EVAHHVLDNI THDMAHMEEI VHGLSGDVTQ IKEIVQKVNVAVEKVKHIVE TEETQKTVEP EQIEETQNTV EPEQTEETQK TVEPEQTEET QNTVEPEQIE

    ETQKTVEPEQ TEEAQKTVEP EQTEETQKTV EPEQTEETQK TVEPEQTEET QKTVEPEQTEETQKTVEPEQ TEETQKTVEP EQTEETQKTV EPEQTEETQN TVEPEPTQET QNTVEP

    ID 11S3_HELANMASKATLLLA FTLLFATCIA RHQQRQQQQN QCQLQNIEAL EPIEVIQAEA GVTEIWDAYD

    QQFQCAWSIL FDTGFNLVAF SCLPTSTPLF WPSSREGVIL PGCRRTYEYS QEQQFSGEGG

  • CHAPTER1. INTRODUCTION 3

    RRGGGEGTFR TVIRKLENLK EGDVVAIPTG TAHWLHNDGN TELVVVFLDT QNHENQLDEN

    QRRFFLAGNP QAQAQSQQQQ QRQPRQQSPQ RQRQRQRQGQ GQNAGNIFNG FTPELIAQSF

    NVDQETAQKL QGQNDQRGHI VNVGQDLQIV RPPQDRRSPR QQQEQATSPR QQQEQQQGRRGGWSNGVEET ICSMKFKVNI DNPSQADFVN PQAGSIANLN SFKFPILEHL RLSVERGELR

    PNAIQSPHWT INAHNLLYVT EGALRVQIVD NQGNSVFDNE LREGQVVVIP QNFAVIKRANEQGSRWVSFK TNDNAMIANL AGRVSASAAS PLTLWANRYQ LSREEAQQLK FSQRETVLFA

    PSFSRGQGIR ASR

    ID 128U_DROME

    MITILEKISA IESEMARTQK NKATSAHLGL LKANVAKLRR ELISPKGGGG GTGEAGFEVAKTGDARVGFV GFPSVGKSTL LSNLAGVYSE VAAYEFTTLT TVPGCIKYKG AKIQLLDLPG

    IIEGAKDGKG RGRQVIAVAR TCNLIFMVLD CLKPLGHKKL LEHELEGFGI RLNKKPPNIY

    YKRKDKGGIN LNSMVPQSEL DTDLVKTILS EYKIHNADIT LRYDATSDDL IDVIEGNRIYIPCIYLLNKI DQISIEELDV IYKIPHCVPI SAHHHWNFDD LLELMWEYLR LQRIYTKPKG

    QLPDYNSPVV LHNERTSIED FCNKLHRSIA KEFKYALVWG SSVKHQPQKV GIEHVLNDEDVVQIVKKV

    In fact, the protein classificationproblem is slightly more complicatedbecause the family corre-sponds only to a portion of the sequence or a domain. Proteinscanhave multiple domains thusbeinginmultiple families. We plan to extendour modelsto specificallyincorporate the notion of domains asde-scribedin thefuture work sectionof theconclusionsin Chapter9.

    1.2.2 SystemCall Anomaly Detection

    IntrusionDetectionSystems(IDS) arebecoming an important partof computersecuritysystems.A majoradvantageof IDS is theability of the IDS to detectnew andunknown attacksby examining auditdatacol-lectedfrom a system.Typically this detectionis performedthrougha datamining techniquecalledanomalydetection[31]. Anomaly detectionbuilds models of “normal” audit data(or datacontaining no intrusions)anddetectsintrusions basedondetectingdeviationsfrom this normal model.

    Systemcall tracesarea common typeof auditdatacollectedfor performing intrusiondetection. Asystemcall traceis theorderedsequenceof systemcalls thata processperformsduring its execution. Thetracefor a givenprocesscanbecollectedusingstrace. Systemcall tracesareusefulfor detecting a usertoroot (U2R) exploit or attack. In this typeof exploit, a userexploits a bug in a privilegedprocess(a processrunning asroot) usinga buffer overflow to createa root shell. Typically, thesystemcall tracefor a programprocesswhich is beingexploitedis drasticallydifferent from theprogramprocessunder normalconditions.This is because the buffer overflow andthe execution of a root shell typically call a very different setofsystemcalls thanthenormal executionof theprogram. Becauseof thesedifferences,we candetectwhenaprocessis beingexploited by examining thesystemcalls.

    Below is a tracefor theprogramps:

    execve, open, mmap, open, mmap, mmap, munmap, mmap, close, open,

    mmap, mmap, munmap, mmap, close, open, mmap, mmap, munmap, mmap, mmap,close, open, mmap, close, open, mmap, mmap, munmap, mmap, close,

    close, munmap, ioctl, open, mmap, close

  • CHAPTER1. INTRODUCTION 4

    1.2.3 Motif Finding in DNA Sequences

    Patterndiscovery in unaligned DNA sequences is a fundamentalproblem in computationalbiology, withimportant applications in finding regulatorysignals.Theseregulatorysignalshelpgovern the processesofactivating and deactivating of genes. Detectingregulatory signalscan give insight into the fundamentalbiological processesaswell ashelpdesigndrugtargetsto fight variousdiseases.Most current approachestopatterndiscovery focuson monadpatternsthatcorrespondto relatively shortcontiguousstringsthatappear(with somemismatches)surprisingly many times (in a statisticallysignificantway). However, the actualregulatory signalscancorrespond to othertypesof structured patterns suchasdyadpatterns which aretwomonad patterns thatoccur atapproximatelyafixeddistanceapart.

    Therehave beenmany approachespresentedto discovery of regulatory patterns,most of whichfocuson monad signals. Among the bestperforming areMEME [10], CONSENSUS[63], Gibbssampler[78, 95], random projections [23], combinatorialbasedapproaches[102], andMULTIPROFILER [73]. Alltheseapproachesfocus on discovering the highest scoringsignalsandmay not be applicable in the casewhereeachof the pair of monadsignalsis not statisticallysignificanton its own. Moreover, the existingsoftwaretoolsto patterndiscovery involve someheuristicsand/or stochasticoptimizationprocedures anddonotnecessarilyguaranteeto find all best-scoring monadsignals.

    DNA sequencesaresubjectto mutationsandasa resultregulatorysignaltypically occurwith somemismatchesfrom the“canonical” patterns.We canrepresentthecanonical patternasan � -mer (a continuousstringof length � ). In casewhenthe biological signalis represented by a profile we usethe mostfrequentnucleotide in every positionof the profile to form the canonical pattern. Although it looks like a crudeapproximation of the profile we explain below that our algorithm is able to recover the profile using thecanonical patternasa“seed”.Weusethetermpatternor signalto referto thecanonical � -mer. Wedefinetheterm �3�! 8�S# -neighborhood of an � -mer T to represent all possible� -merswith upto � mismatchesascomparedto T . For thealphabetof nucleotidesthesizeof the �3�! 8�S# -neighborhoodfor any � -mer is U�VWYX 70Z\[WY] � W . Weusethe term instancesof thepatternT to refer to � -mersfrom the ���^ 8�S# -neighborhood of T thatarepresentinthesample(i.e., � -merin thesamplewith up to � mismatchesfrom T ). Givena setof patterns _ we call an� -mervalid if it is aninstanceof a patternT:`a_ .

    We definethepatterndiscovery problemasfollows. Givena sequence , , we wantto find all � -mersthatoccurwith up to � mismatchesat least � timesin thesample.Such� -mersarecalled ���^ 8�S#b%c� patterns.A variant of theproblem assumesthat the sample, is split into several sequences andwe want to find all� -mers thatoccurwith up to � mismatchesin at least � sequencesin thesample.

    Similarly, we candefinea variant of the problem for dyadsignals. In this case,we areinterestedin discovering two monads that occurwithin a few basesof a certainlengthapart. We usethe notation����de%f�3R&d gRih�#j%k�lh& "�$#j%G� patternto denoteadyadsignalwhichconsistsof two monads(apatternof length ��dandpatternof length �mh ) separatedby at least R�d nucleotidesandatmost R�h nucleotideswhichoccurs � timesin thesample.

    Below arethreeDNA sequencesupstreamfrom genesfrom theE. coli genome:

    >Genome: Escherichia coli gene1: ECISEL1 gene2: REC04295 dist 254ACCTCAATGTGTATCACAATATCCATATTCTTTGTGGGGGAGTCTGGAGATTGAGTAGAT

    ATTCTTGTTCAGAATGTATCAGCCGATGGTTCTACGATTCTTAAGCCACGAAGAGTTCAG

    ATAGTACAACGGCATGTCTCTTTTGACTATCTGGCAACCGGCAGTGTGTTCTCTCACGCATCACAAAAGCAGCAGGCATAAAAAAACCCGCTTGCGCGGGCTTTTTCACAAAGCTTCAGC

    AAATTGGCGATTA

  • CHAPTER1. INTRODUCTION 5

    >Genome: Escherichia coli gene1: REC00034 gene2: REC04296 dist 92

    TAATTTTATCTCGTTGATACCGGGCGTCCTGCTTGCCAGATGCGATGTTGTAGCATCTTA

    TCCAGCAACCAGGTCGCATCCGGCAAGATCA

    >Genome: Escherichia coli gene1: REC00048 gene2: REC04302 dist 84TAATTTTGTATAGAATTTACGGCTAGCGCCGGATGCGACGCCGGTCGCGTCTTATCCGGC

    CTTCCTATATCAGGCTGTGTTTA

    An exampleof a signalis theclassicalPribnow-Gilbert boxTTGACA-(16,18)-TATAAT. Typi-cally this signaloccurs with several mismatchesfrom thecanonicalsignal.

    1.3 A SparseSequenceModeling Framework

    In this thesis,we presenta generalframework for modeling sparsesequences. This framework consistsoftwo parts. We first definea setof sparsemodelsaccording to the specificmodel of “sparseness”that weemploy. Secondwedefineasetof datastructuresthatindex thedataby thesparsemodels.

    1.3.1 SparseModels

    Fundamentalto our approachto modeling sparsesequencesis our notionof “sparse”models. We refer tothe dataasobserved sequencesor instances.We refer to the spaceof corresponding sparsemodels asthefeaturespace.Sparsemodels arethepossiblemodelsthatexplain theobservedsequences. In our previousexample, given theobservations of the DNA sequence n�*�n0/1*on and n�*0*�*0*on if we allow � mismatchin themodel, a possiblemodelthatcan“explain” thesetwo observedsequencesis n�*on0*�*on . We usethetermcorrespondto describetherelationbetweenanobservedinstanceof a sequenceanda sparsemodelcan“explain” thisobservation.

    Thereare many differentkinds of sparsemodelsincluding wild-card models, mismatchmodels,triggermodels,suffix models,etc.Eachof thesemodels canbeviewedasapredicatefunctionoversequences.Any sequencethatfits themodelwill have a non-zerovalue for thepredicate. All othersequences will havea zerovaluefor thepredicate.

    1.3.1.1 Prefix and Suffix Models

    A suffix model (andequivalently a prefix model) corresponds to sequencesthat containthat suffix. Forexample, the instance*�pq/ �arts + correspondsto suffix models u , + , s + , rts + , etc,where u representsthenull context. Similarly this instancecorrespondsto prefixmodels u , * , *�p , *�pq/ , etc.

    Suffix models aretypically usedfor predictionof thenext symbol. For example, Probabilistic SuffixTree(PST)[107] modelseffectively choosewhichsuffix modelto usebasedon thecontext.

    For a sequenceof length � , therearetypically �)v � correspondingprefixor suffix models.1.3.1.2 Wild-card Models

    Wild-card modelsallow for wild-card symbolsin the sparsemodels. Thesesymbols canmatchan arbi-trary symbol in theobserved sequences. Typically we consider wild-cardmodelsof a certainlength. Givenan observation sequence*0pq/ �wrts + , a possiblewild-card model that corresponds to the observation isx p x �wrts x . We canrepresent suffix (andprefix) modelsaswild-card models.For example themodelsuffixmodel

    rts + canberepresentedas xx�xx rts + .

  • CHAPTER1. INTRODUCTION 6

    Wild-cardmodelsareusedin SparseMarkov Transducers(SMT) whichareappliedto proteinfamilyclassificationin Chapter3.

    A problem with wild-card models (andmostsparsemodels)is the exponentialnumber of modelsthatcorrespond to anoccurrence.Sincewild-cardscanbeplacedin any position, for a sequenceof length � ,thereare � [ correspondingmodels.Evenwith relatively small � , thiscanbecomeanintractablylargenumberof models.

    1.3.1.3 Mismatch Models

    Mismatchmodelscanbeviewedascanonical sequenceswhich toleratea certainnumberof mismatchesintheiroccurrences.Considerthemodel *0pq/ �wrts + whichallows for � mismatch.Valid occurrencesof thismodel are

    � pq/ �wrBs + , *�py/ �wr +'+ , *0pq/'/ rts + , etc.Mismatchmodelsareeffective in modeling DNA sequencesbecauseof thefrequency of mutations

    in thesesequences.Mismatchmodelsform the coreof MITRA, the method for discovering transcriptionfactorbinding sitesin Chapters7 and8.

    Observedsequencescorrespond to anexponentialnumber of mismatchmodels. For example con-sidermismatchmodelswherethe modelsarelength � andallow � mismatches. Eachsequenceof length �will correspondto U VWYX 7�Zm[W ] �8z {1zi% � # W modelswhere z {1z is thesizeof thealphabetor thenumberof possiblesymbols. Evenwith relatively smallvaluesfor z {1z , � and � this cancauseseriouscomputationaldifficulties.1.3.1.4 Trigger Models

    While thepreviously presented typesof sparsemodelsfocuson shortsequences, trigger modelsfocus on adifferentkind of phenomenon. Triggermodels describea setof key eventsor triggersin a sequencewherethe events canbe interspersedby many otherevents. Considerthe event p followed by � followed by swhich we denote pE| � | s . An instanceof this trigger model is thesequence *�py/ �wrts + . Likewiseotherinstancesare *�*�pq+'+ �ws *�* and p �as .

    Triggermodels areeffective for modelingeventstreams.Triggermodelsareusedin LMPTs (LargeMargin PredictionTrees)describedin Chapter2. Weplanto usetriggermodelsto model auditstreamsin thecontext of intrusiondetectionasoutlined in Chapter9.

    For a given sequence of length � , and trigger modelsthat contain � triggers,thereare up to Z [V ]possiblemodels for eachobserved sequence. Sincetypically in trigger models, � is typically larger thanforothertypesof models,therearemany possiblemodelsthatneedto beconsidered.

    1.3.1.5 FeatureSpaces:Spectrumsvs. Subsequences

    Oncewe definea setof sparsemodelsfor our application, this definesour featurespace.Thefeature spaceis a vector spacein }�~ where � is the total numberof sparsemodels. Eachcoordinateof the feature spacecorrespondsto a sparsemodel. Eachsequenceis mappedto a point in thefeaturespaceusingthepredicatesdefinedby thesparsemodels.

    For example,givenalphabet { � A*t 8pa "/4 �DC , if weareusingonlysuffix modelsthanthesequence*�py/ � wouldbemappedto avectorin thefeaturespacethatwouldhavenon-zerovaluesfor thecoordinatescorresponding to thesparsemodelsu , � , / � , pq/ � , *0pq/ � . All othercoordinatesof this vectorwouldbezero.

    If we wereusingdifferenttypesof sparsemodelssuchasmismatchmodelsor wild cardmodels,therewouldbemany morenon-zerocoordinatesfor thevector.

  • CHAPTER1. INTRODUCTION 7

    Oursparsemodelsaredefinedfor relatively shortsequencesof length � . However, thesequenceswearemodeling aretypically muchlonger. Therearetwo approachesfor handling this. First, we canprocesseachsubsequenceindependently. Thismethod is usedin SparseMarkov Transducers(Chapter3)andMITRA(Chapters 7 and8). Another way to model thesequencesis to simply definethevectorfor a sequencein thefeaturespaceto be the sum of the vectors definedby its subsequences. This vector is referred to as thespectrumof thesequenceandis usedin LMPTs andalsoin [83] and[82].

    1.3.2 Data Structures

    We employ a setof datastructures,eachtailoredto aspecifictypeof sparsemodels,to allow for theefficientprocessingof thesparsemodels.Thedatastructuresindex thedatain awaywhichallowsthequickrecoveryof thesequencesin thedatathatcorrespondto a givensparsemodel. Therearetwo majorvariants of datastructures presentedin this thesis.Thefirst typearedatastructureswhicharedesignedfor efficient traversalof thefeature space.An example of this typeof datastructureis a mismatchtree.Thesecondtypearedatastructures whicharedesignedfor keepingstatefor modelsin thefeaturespaceefficiently. Thisdatastructureis usefulfor onlineprediction suchastheSparseMarkov Transducerdatastructure.

    1.3.2.1 Mismatch TreeData Structur e

    An exampleof adatastructure thatallowsfor efficient traversalof thespaceof all modelsis amismatchtree.Mismatchtreesaredesignedfor mismatchmodels.Let usassumethatwe areconsidering mismatchmodelsof length � wherewe allow � mismatchesandwith alphabet { � A�*B "/4 "+2 ^n C . A completedescriptionofmismatchtreesaregivenin chapter7.

    Mismatchtreesaresimilar to the suffix treesandtries that have a long history of applications tostringmatching problems [52]. Thepathsfrom theroot to the leavesin a mismatchtreerepresentnot onlythe substringsin the data(like in suffix treesandtries), but alsoall neighbors of thesesubstrings with upto � mismatches.The datastructureis a variationof the sparseprediction treesfrom Eskin et al., 2000[38]. A mismatchtreeis a rooted treewhereeachinternal nodehas - brancheslabeledwith eachelementof A*t "/4 "+2 ^n C . Themaximumdepthof the treeis � . Eachnode in the mismatchtreecorresponds to thesubspaceof modelswith a fixedprefix that is definedby thepathin thetreefrom theroot to thenode.Forexample, thenode thatis reachedby thepath *0/1+ correspondsto thespaceof all modelsthathavea prefix*0/1+ . Theroot nodeof themismatchtreecorrespondsto thespaceof all patterns.Eachnode correspondsto a subspace of models _ andcontainspointers to all � -mers instancesfrom the samplethat arewithin �mismatchesfrom a model T:`a_ (valid � -mers).

    Consideraverysimpleexample of processingthepatternsof length ( with upto � mismatchesin theinput sequenceAGTATCAGTT. Thesubstring( ( -mers)in theinput sequencesareAGTATCAG, GTATCAGTandTATCAGTT. Figure1.1ashows the initial tree while Figure1.1b shows the tree after expandingthebranch * . Notice that in mostof the valid � -mers in the leaves the number of mismatchesincreases.Aswe continue expandingthe treefor many of the � -mersthenumberof mismatches will reachthemaximumallowedwhichwill causemany of � -mers to notpassonto theleafnodes.This is shown in Figure1.1cwherethebranchn is expandedfrom Figure1.1b. Thedeeperwe getin thetree,thefewervalid � -mers we keep.1.3.2.2 SMT Data structur es

    An example of a datastructurethat is usedto storethe featurespacein a compact way is the SMT datastructurewhich is describedin detailin Chapter3.

  • CHAPTER1. INTRODUCTION 8

    (a)

    A

    A

    A

    GT

    TC

    G

    G

    AT

    A

    T

    T

    C

    G

    T

    TC

    G

    T

    A

    A

    T

    0 00

    (b)

    A

    TA

    C

    G

    G

    T

    A

    G

    AT

    A

    T

    T

    C

    G

    T

    TC

    G

    T

    A

    A

    T

    0 00

    G

    AT

    A

    T

    C

    G

    T

    TC

    G

    A

    A

    T

    TC

    G

    T

    A

    A

    T

    0 1 1

    A

    (c)

    G

    AT

    A

    T

    C

    G

    T

    TC

    G

    A

    A

    T

    TC

    G

    T

    A

    A

    T

    0 1 1

    A

    TA

    C

    G

    G

    T

    A

    G

    AT

    A

    T

    T

    C

    G

    T

    TC

    G

    T

    A

    A

    T

    0 00

    AT

    A

    T

    C

    G

    TC

    G

    A

    A

    T

    11

    A

    T

    Figure1.1: A MismatchTreefor a sequenceAGTATCAGTT correspondingto a searchfor ��() � # motifs. Thepathsfrom the root to the nodes definethe labelsof the nodes. The leaf nodescontain(i) the number ofmismatchesbetweentheprefix of a substringin thesequenceandthe labelof thenode and(ii) a pointertothe tail of the instance.(a) The treeis in its initial state. (b) The treeafterexpanding the path * . (c) Thetreeafterexpanding thepath *�+ . Noticethatmany of theinstancesreachthemaximum numberof allowedmismatchesandwerenot passeddown to their childrennodes. For clarity, not all of thepointersaredrawnin (c).

    The SMT datastructure definesa new way to storethe featurespace. In this model the childrenof nodesin the templatetreeareeithernodesor sequences. Thekey ideais that insteadof alwaysstoringnodes in thetreewhich explicitly correspond to modelsin thefeaturespace,we insteadstorethesequencesandgeneratethemodelsin thefeaturespaceon demand. Figure1.2givesexamplesof thedatastructure.Aparameterto thedatastructure, , max definesthemaximum numberof sequencesthatcanbestoredon thebranch of anode.

    1.3.3 Classification/Transduction of SparseSequences

    Thefirst problemthatwe addressis thetransduction problem. In thetransduction problemwe aregiven aninputsequence(or setof input sequences) < d QYY < ~ andcorrespondingoutput sequence ; d QYY ^;~ . Thegoalof theproblemis to build a transducerwhich canmapinput sequencesto output sequences.A specialcaseof thetransductionproblemis theclassificationproblemwheretheoutput sequencesarejustasinglesymbolfor eachinputsequence. In thiscase,theoutput symbol correspondsto theclassof thesequence.

    In this thesis,we show how to addressthis problem in our sparsesequence modeling framework.WepresentLargeMargin PredictionTrees(LMPTs)whichareasetof sparsesequencemodelstrainedusingSupport VectorMachines(SVMs) [27] andBoosting[25]. For LMPTs, we modify theclassicalSVM andBoostingalgorithms to take advantageof the datastructurespresentedin this thesis. The LMPT modelssubsumeseveral previously presentedsequence models including SpectrumKernelmodels[83] and[82].We alsopresent SparseMarkov Transducers (SMTs)which usewild-card models trainedusinga Bayesianmixture techniqueanda variant of theefficientdatastructures.

    Critical for sequencemodels for classificationis theuseof context basedpriors. This wasdemon-stratedin the effectivenessof usingDirichlet priors to improve classificationaccuracy [118]. In this thesiswe presentMixture of CommonAncestors (MCAs) which is aneffective context basedprior. Themainad-vantage of MCAs is that it doesnot suffer from problemsrelatedto local minimathataffect Dirichlet priorswhichmakesit effective for largealphabets[39]. WealsoextendtheclassicalBoostingalgorithmto takeintoadvantagecontext basedpriors.

    Theclassificationtechniquesareapplied to theproteinfamily classificationproblem.

  • CHAPTER1. INTRODUCTION 9

    CDACAC A

    ACACDAC

    DCADAC D

    ACADAC C

    ACAAAC D

    CADAC C

    DCAAAC

    DACAC A

    ACDAC A

    CADAC D ACADAC C

    ACAAAC D

    CADAC C

    DCAAAC

    DACAC A

    ACDAC A

    CADAC D

    ADACAC

    AACDAC

    DACAC D

    ACDACDCADAC

    DADAC DACAC

    AACAC

    (a) (b)

    φ0 φ1

    A D A D

    C

    φ0 φ1

    φ0 φ1

    A D A D

    C

    DC A D

    C

    Figure 1.2: Efficient data structuresfor SMTs. The boxes represent input sequenceswith theircorresponding output. (a) The tree with , max � - after input sequences *4/ � *0/1*4/2��*4# ,� *4/1* � *4/2�3/'# , � *0/1*0*�*0/2� � # , *4/1*0/ � *4/2��*4# and * � /1* � *4/2� � # . (b) The treeafter input se-quence *0/ � *0/1*0/2� � # is added. Notethata node hasbeenexpandedbecauseof theadditionof theinput.1.3.4 Outlier Detectionfor SparseSequences

    Thesecondproblem thatwe addressis theproblemof outlierdetectionin sparsesequences.In this problemwe aregiven asetof sequences andthegoalis to detectwhichsequences areoutliers within this set.

    We present two methods for performing outlier detectionover sequences.Eachmethod usesa dif-ferentframework for defining outliers. Thefirst methodemploys a geometric framework for unsupervisedanomaly detection(UAD) [36]. This framework mapssequencesto a vector space(definedby sparsese-quence models)and then determines which points are outliers in this featurespace. This methodtakesadvantageof the efficient datastructuresto compute the pairwisedistancesbetweenpoints in the featurespace.

    Thesecondmethod usesa generative framework for detectingoutliers[41]. We fit a sequence intoa model trainedover the databy usingthe model to predict the next symbolof the sequence.Using thisprediction we cancompute the overall likelihoodof the sequencegiven the model trainedover the data.Usingthis likelihood we candetermine whichsequencesaretheoutliers.

    We apply the outlier detectionalgorithms to modeling systemcalls. We detectanomaloussystemcall sequencesandusethemto detectintrusions.

    1.3.5 Signal Finding in SparseSequences

    Thethird problemwe addressis thesignalfindingproblem. In this problemwe aregivena setof sequencesandthegoalis to find thehighestscoring“model” over thesequences.This scoringcanbedefinedeitherinnumber of total occurrencesor a moresophisticatedmeasureof scoring. Our approachesto signalfindingfocuson detectingshortsequenceswith mismatchessincethesetypesof modelsarerelevant to biologicalapplications.

    In this thesis,we present Mismatch TreeAlgorithm (MITRA) [40], analgorithm for signalfinding

  • CHAPTER1. INTRODUCTION 10

    appliedto DNA sequences.Thecoreof MITRA aretheefficient datastructuresandmethodfor pruning thesearchspaceof possiblemodels. We presenttwo methods for pruning thesearchspaceeachcorrespondingto a variant of the algorithm. MITRA-Count usesan algorithm similar to the onepresented in [109] whileMITRA-Graph usesa novel algorithmwhich incorporatesideasfrom [102].

    WeapplyMITRA toadetectingtranscription factorbindingsitesin theDNA sequences. Specificallywe focusona classof compositebinding sitescalleddyadsignalswherethey signalhastwo partsseparatedby afixedlengthsequence.WeapplyMITRA to discover dyadsignalsin bacterialgenomes[37]. Becauseoftheefficient datastructures,theMITRA algorithm is ableto scaleup to thesizenecessaryfor processinganentirebacterialgenome. Critical to effectivesignalfinding is themethod for scoringthemodels.We presenta new method for scoringsignalswhichwe applyto thebacterial genomes.

    1.4 Contrib utionsof this Thesis

    Thecontribution of thethesiscanbesplit into four parts.Thefirst setof contributionsrelateto thegeneralmodel for sparsesequences. The next threesetsof contributions correspondto the threecomputationalproblemsaddressedin thethesis.

    GeneralSparseSequenceModeling General methodfor modelingsparsesequences Arbitrarymodelsof sparseness(mismatchmodels,wild-card models,triggers,etc.) Efficient datastructuresto keepmodelsin memoryTransduction/Classification SVM andBoostingto trainsparsesequence models Modificationsto BoostingandSVM Algorithmsto iterateover features SparseMarkov Transducers(SMTs) Boostingextensionto incorporatecontext priors Mixture of Common Ancestor(MCA) context Priors Proteinfamily classificationapplicationOutlier DetectionProblem Geometric framework for outlierdetectionusingkernels Prediction approachto outlierdetection Systemcall modelingusingsparsesequencemodelsSignal Finding MismatchTreeApproach(MITRA) for motif-finding Graphtheoretic approachto pruning searchspacefor mismatchmodels. Scoring Functionsfor GenomeWide Analysisin bacteria

  • CHAPTER1. INTRODUCTION 11

    1.5 Organizationof Thesis

    Thethesisis organizedasfollows. Therearethreemajorpartsof thethesiseachcorresponding to a compu-tationalproblem.

    Thefirst partdiscussessparsesequencemodelin thecontext of thetransduction/classificationprob-lem. Themodelsareapplied to theprotein family classificationproblem. Chapter2 presentsLarge MarginPredictionTreeswhicharethecenterpieceof thisapproach.Thischapter includesadescription of theframe-work, different modelsof sparseness,theSVM andBoostingalgorithms in thiscontext andtheefficientdatastructures for computing the models. This chapteralsodescribesthe context basedextensionto boosting.Chapter3 presentsSparse Markov Transducers which are an alternateapproachfor modelingsparsese-quenceswill wild-cards andusea variantof thedatastructures.Chapter4 describes theMixtureof CommonAncestors prior. Themodelsandapplicationto ProteinFamily Classificationpresentedin Chapters2, 3, and4 arejoint work with YoramSinger, William NobleandChristinaLeslie.

    Thesecondpartdiscussessparsesequencemodeling appliedto outlier detection. Theseapproachesareappliedto anomalydetectionover systemcall tracesin thecontext of intrusiondetection. Chapter5 de-scribesthegeometricframework for outlierdetectionandtheapplication of sparsesequencemodels.Chapter6 describesthegenerative framework for outlier detection. Chapters5 and6 arejoint work with SalvatoreStolfo.

    Thethird partdiscussessignalfindingoversparsesequences.Thesemethods areappliedto discov-eringtranscription factorbinding sitesin DNA sequences.Chapter7 describestheMITRA algorithm anditsapplication to discoveringtranscription factorbinding sites.Chapter8 describestheapplication of MITRAto thediscoveryof transcription factorsovercompletebacterial genomes.Chapter7 is joint work with PavelPevznerandChapter8 is joint work with Pavel Pevzner andMikhail Gelfand.

    Finally, we conclude with a discussionof future directions in Chapter9. Thesedirections includeextensionsto thealgorithmsandsolvingnew computationalproblemswith thesparsesequencemodel frame-work. Also, wediscusspotential applications of themodelsto differentareasin computer science.

  • 12

    Part II

    Classification of SparseSequences

  • 13

    Classificationand Transduction of SparseSequences

    The problem of classificationof sparsesequencesis a supervisedlearning problem. We aregivena setofsequences < d iYY < ~ eachwith a corresponding label ; d iYY ^; ~ . Thegoal is to learna classifierwhich cantake a new sequence < andpredict its label ; . Typically eachsequence is obtained by passinga slidingwindow of a certainlengthover the input sequences. In this case,if thesequences< W aredefinedin a waythat < W �>< W� [ iY < W and ; W is a corresponding output for position < W , thenwe canview this problemasatransduction problem. In this case,we arelearningthemapping from inputsequences< to output sequences; .

    In Chapter2 wedefineLargeMargin Prediction Trees(LMPTs),aclassificationmodelbasedon thesparsemodelframework whichusesBoostingandSupport VectorMachinesfor trainingthesequences.Sinceclassificationis oftenamulti-classproblem( ; W is oftena labelfrom asetof possiblelabels)andboostingandSVMs arebinaryclassifierswe usetheoutput codingframework to allow for multi-classpredictions.

    In Chapter3wedefineSparseMarkov Transducers(SMTs)aclassificationmodel basedonwild-cardmodels andaBayesianmixture technique. Therearesomeinherent incompatibilitiesbetweentheLMPT andSMT models whichdonotallow themto beeasilyintegrated.Thereasonis thatthey useadifferentlearningparadigm. LMPTs usethe decisiontheoretic approachof margin classifierswhile SMTs usethe Bayesianapproachof mixture models. Becauseof theseincompatibilities wepresentthemseparately.

    An importantfactorfor effectiveclassificationof asequenceis context-basedpriorsasdemonstratedin [118]. We present Mixturesof Common Ancestors(MCA) a type of context-basedprior that hassomeadvantagesover previousmethods in Chapter4. Thesepriorsarevery easyto incorporateinto SMTs. ForLMPTs, we presentan extension of the boosting algorithm in order to incorporatethesepriors. Using theextension andthe output coding framework, we canapply incorporateMCA priors into the LMPT model.Thecontext-basedprior boostingextensionis presentedin Chapter2.

    All of thesequence classificationtechniquesdescribed in thesechapters,areappliedto theproblemof proteinfamily classification.

  • CHAPTER2. LARGE MARGIN PREDICTIONTREES 14

    Chapter 2

    Lar geMar gin Prediction Trees

    2.1 Moti vation

    Modeling discretesequencesis a fundamentalproblemwhich hasbroad applications in many domains. Inmany of theseproblems,thedatacanbecharacterizedassparse. For example, considera setof aminoacidsequencesof length ( . Sincethereare ��� aminoacids,therearea total of ��� possiblesequences of length( . For evenlargedatasets,only a tiny fraction of sequenceswouldbeobservedandin practice,rarelywouldthe samesequence be observed twice. For example, two subsequencestaken from the 3-hydroxyacyl-CoAdehydrogenaseproteinfamily, t*4t$+t,+1n and B+tt5+'5+1n , areclearly very similar. However, theyonly have at mosttwo consecutive matchingsymbols.If we allowedmatchinggaps or wild-cards (denotedby ), we noticethat they matchvery closely: ?'=1+'+1n . Similarly, in otherproblemswe canallowdifferentcriteriafor matching sequencessuchasbeingwithin a certainnumber of mismatches.

    Themostpopular approachto modeling sequencesis usingHiddenMarkov models (HMMs) [76].ThesemodelsandrelatedmodelssuchasMaximumEntropy Markov Models[88] andConditional RandomFields[66] usegraphical models to representconditional dependencesbetweenelements of thesequences.

    In thischapter, wepresentlargemargin prediction trees(LMPTs),analternativeapproachto model-ing sequencesthatspecificallytakesinto account thesparsepropertiesof thedata.LMPTsmodelsarerelatedto probabilisticsuffix trees(PST).A PSTis amodel thatpredictsthenext symbolin asequencebasedontheprevioussymbols (for a formal description seefor instance[131, 107, 57]).

    LMPTs generalizePSTmodelsin threeways. LMPTs provide a simplegeneralizationfrom a pre-diction (generative) modelto a transduction (discriminative) modelwhich probabilisticallymapssequencesover aninput alphabet to correspondingsequences over anoutput alphabet. This formalismallows theinputalphabet to be different from the output alphabet. We make useof this generalizationin our experiments.LMPTsusetheoutput codingframework presented in [2] to perform multiclasspredictions.

    LMPTs alsogeneralize thenotionof conditioning on previous symbols in PSTsto a moreflexiblenotion of conditioning on sparsecontexts. LMPTs definea setof predicatesthat correspond to a specificcontext. An example of a predicatewould beonecorresponding to theabove context t1+1+1n . If asequencecontainsthiscontext, thepredicatewouldreturn � . Otherwise,thepredicatewouldreturn � . Insteadof conditioning ononly theprevioussymbols,we cancondition onarbitrarysparsecontexts. This allows forextending PSTsto handle sparsedatasuchastheincorporationof wild-cardsor gapsasin [38]. Furthermore,this allows for muchmoregeneral models of sparsedataincluding mismatchmodels, trigger models,andspectrummodelssuchasthosepresentedin [83].

  • CHAPTER2. LARGE MARGIN PREDICTIONTREES 15

    Finally, LMPTs generalizethe way thesemodels are trainedwhich allows us to usealgorithmssuchasBoosting[25] andSupport VectorMachines[27] to train themodels. We presentversions of thesealgorithmsappropriatefor oursetting.We alsodefinethesealgorithms overabstractpredicateswhich letsusapplythemto different models of sparsedata.

    Severalprevioussequencemodelingmethodscanberepresentedasspecificinstancesof thegeneralclassof models corresponding to LMPTs. Thesemodels include thespectrum kernel andmismatchkernelSVM models[83, 82]. Several otherprevioussequencemodels,althoughnotexactlyrepresented,arecloselyrelatedto instancesof LMPTs. Thesemodels includePSTsandthevarious extensions [117, 7], andSparseMarkov Transducers[38]. LMPTsalsoprovidenew methods to modelsequences.

    A general problemwith modeling sparsesequencesis theexponentialnumber of possiblecontextsthatneedto beconsidered.For example, if we arelooking at a PSTmodel,we consider varioussuffixesofthe input sequence.If the input sequence is B*0B�+t,+1n , the possiblecontexts arethe null context, thecontext consistingof thesuffix n , thecontext consistingof thesuffix +1n , etc. In general, for a PSTmodel,givena sequenceof length � , thereare �v � possiblecontexts. However, whenmodeling sparsesequences,suchasif weallow wild cardsor mismatchesin thecontexts,thenumberof contextsgrowsexponentially. Tostoreall of thesecontexts in memory is oftenimpractical evenfor smalldatasets.

    To addressthis problem,we useefficient datastructureswhich index thedatato allow us to iteratethrough thecontexts efficiently. Thekey ideais thatwe traversethespaceof all contexts andprocesseachcontext’s relevant dataduring this traversal.This allows usnot to have to keepstateon any of thecontextswhich allows LMPTs to have minimal memory requirements.We modify the standardboostingandSVMalgorithmsto takeadvantageof thesedatastructures.Although LMPTsrequireminimalmemory, wepresentanextensionwhichcantakeadvantageof availablememory tospeedupthetrainingandpredictionof LMPTs.

    2.2 Preliminaries

    Wearegivenasetof 9 inputstrings< d iY

  • CHAPTER2. LARGE MARGIN PREDICTIONTREES 16

    of a prediction is z b� < W #iz .In our model, we will learna setof weightsfor eachpredicate.We denote theweightof a predicate with . Usingtheseweights,theprediction of themodelisb� < # �� i�� < #v¡ ¡ (2.1)

    where  is theoffsetof thehyperplanefrom theorigin. Notethatfor any inputstringthevastmajority of thetermsin thesumare � because � < # �� for most . Thepredicateweightswill beobtainedusingeithertheSVM algorithmor boosting. In theboostingalgorithm, we will computea new setof weightsduring eachround of boosting. We usethenotation 2¢ to denote theweight of the th predicateafterboostinground � .We alsouse ¢M� < # to denotetheprediction of themodelusingtheweightsafterthe � th roundof boosting.2.4 Learning with Boosting

    We usethe formulation of Boostingpresentedin [25] which presentsthe boostingalgorithmin termsofBregmandivergences.We first presentthealgorithm asoriginally presentedusingslightly different notationappropriatefor oursettingandthenpresentamodificationto take advantageof theefficientdatastructures.

    2.4.1 Original Algorithm

    Thegoalof aboostingalgorithmis to chooseasetof weightsad� QYY ^ iY 8 ~ thatminimizea lossfunctionover thetrainingdata.In ourexperiments,we usetwo lossfunctions,theexponentialloss(ExpLoss) andthelog loss(LogLoss). Theexponentiallossis definedas

    ExpLoss�£0# � WYX d J ¥¤¦�§�¨©�¦lª (2.2)Thelog lossis definedas

    LogLoss��0# � WX d¬«Y�® � � v¡J ¥¤¦£§�¨©�¦£ª #¯ (2.3)Notethatthequantity ; W b� < W # is positive if theprediction of themodelis correct on example N andnegativeotherwise.We present thealgorithmin termsof LogLoss, but it canbeeasilyadaptedfor ExpLoss.

    To minimize the loss over the training set, in eachround of boosting we will update the weightsusinga parallelupdate asdescribedin [25]. We will present thealgorithm in termsof LogLossbut it canbesimilarly appliedfor ExpLoss.

    We first initialize 7 � dh for all . For eachround � , we first compute lossfor eachexample N attime � , °&¢W . For Log Loss ° ¢W � «Y�® � � vcJ ¥¤¦�§�¨©�¦lª #± (2.4)For eachpredicate , we compute the sum of the lossesfor the for the examples wherethe predicate isnonzero, � < W #2²�@� . We separatelycompute this sumfor exampleswherethe label is positive ( ³ ¢£´ ) and

  • CHAPTER2. LARGE MARGIN PREDICTIONTREES 17

    negative ( ³ ¢ ).2 ³ ¢£´ � Wlµ sign̈¤ ¦£¶�· ª£X ´ d ° ¢m¸ W z ; W � < W #Qz (2.5)³ ¢ � Wlµ sign̈¤ ¦ ¶�· ª£Xb d ° ¢m¸ W z ; W $� < W #Qz (2.6)Usingthesetwo quantities,we cancomputetheweightupdateamount ¹º¢ .

    ¹ ¢ � �� «Y» ¼ ³ ¢£´³ ¢ ½ (2.7)Finally theweightsfor thenext roundof boostingcanbeobtainedby ¢£´ d � ¢ v¡¹ ¢ (2.8)2.4.2 Efficient Extension

    Thereareseveral reasonswhy theoriginal boostingalgorithm is notpracticalfor modelingsparsesequences.The main problem is that the number of contexts � is very large, typically exponential in the sizeof thealphabet. Thuscomputing b� < W # is problematic.

    Althoughtherearea very largesetof contexts, only a small fractionof thepredicatesarenonzerofor a given example. Thusif we couldrecover whichpredicatesarenonzerofor a given sequenceefficiently,this wouldsignificantlyspeedup thealgorithm.

    Anotherproblemis that storing the weight vectors ( ¾¢ ) is often impractical becauseof memoryconsiderations. We insteadproposea variantof the boosting algorithm which doesnot require storingtheweightvectors,but insteadstoresall °$¢W from � �� to thecurrent roundof boosting. Usingthese°º¢W we canregeneratetheweightsasnecessary. Thebasicideais that we usea datastructure that allows us to iteratethroughall of thecontextsandefficientlydeterminewhichexampleshaveanonzeropredicatefor thatcontext.For eachcontext, we generate t¢ usingthehistoryof °$¢W andthenupdate theprediction for all examplesthathave a nonzeropredicatefor that context. After we finish the iteration, we usethe predictions to compute° ¢£´ dW .

    More formally, at eachround of boostingwe initialize a vector of predictions I ¢£´ dW ��� for all N . Wetheniteratethrough the predicates . For eachpredicate,we determine the setof exampleswhich have anonzerovaluefor thepredicateusingthedatastructure. Usingequations(2.12), (2.13), and(2.7) wecompute³¿ ´ , ³¿ and ¹ ¿ for � �� � iYY ^� usingthestoredvalues° ¿W . We thencompute ¢£´ d by

    ¢£´ d � 7 v ¢¿ X 7 ¹ ¿ (2.9)We increment I ¢£´ dW with ¢£´ d . After we iteratethrough all of the predicates I ¢£´ dW � ¢£´ d � < W # . Using¢£´ d � < W # we cancompute ° ¢£´ dW usingequation(2.4).

    This versionof thealgorithmonly requiresstoringthe ° ¿W for � �� QYY ^� andnot any valuesof B¢ .Sincewe mustiteratethrough thevalues of ° ¿W for previousrounds, thecomplexity of theupdatealgorithm

    2If we allow for the predicateshave negative values,thereis a slightly different interpretation of À¬ÁÃÂÄ and À±Á3ÅÄ . MAYBE DROPTHIS POINT.

  • CHAPTER2. LARGE MARGIN PREDICTIONTREES 18

    with respectto the numberof rounds of boosting � is Æ2�£� h # asopposedto Æ2���^# for theoriginal algorithm.Sincetheparallelupdatealgorithm convergesveryquickly, this is not a computationalbottleneck,while thememory considerationsof thetraditional algorithm are3.

    Thekey to theefficient algorithm is theability to iteratethroughthepredicatesandefficiently recoverall elements in the datawhich have non-zerovaluesfor thosepredicates. In section2.6 we presentdatastructures for performingthis iteration.Sincethepresentedalgorithm doesnotdependonthespecifictypeofpredicatesin thesetof possiblepredicates,we canincorporatearbitrarymodels of sparsedataby modifyingthesedatastructures.

    2.4.3 Context Priors for Boosting

    Many sequencemodelssuchasPSTs[117], SMTs[38], HMMs [76] takeadvantageof prior knowledgeabouttheproblemencodedin termsof context basedpredictions.ThesepriorsincludeDirichlet mixtures[118] andMCA [39]. We incorporatecontext priors into the boostingalgorithmin a similar way thatexample basedpriors wereincorporatedinto boostingin [106]. We presentour extensionin termsof thegeneral algorithm,but it canbeeasilyappliedto thememoryefficientversionaswell.

    Priors areencoded into the boosting algorithm by pairs of “virtual” examples, onepair for eachcontext. We usethenotationL ´ and L to denotethevirtual example correspondingto thecontext . Theseexamples are labeled v � and % � respectively. For this example we define ¿ �£L ´ # �� if �Dz� and ¿ �£L ´ # �È ´ if � � where È ´ correspondsto weightof theprior in thepositive direction. Similarly, L is definedwith È whichcorresponds to theweightof theprior in thenegativedirection.

    Theprediction of thevirtual examples L ´ and L afterboostinground � is simply q¢ È ´ and �¢ È .Thelosses�£L ´ and ��L of thevirtual examplesL ´ and L canbecomputedusingequation (2.4) whichgives(for LogLoss) ��L ´ � «Y�® � � v¡J ɺʷ^Ë�Ì· # (2.10)�£L � «Y�® � � vcJ ¥ÉºÊ· ËÍ· # (2.11)this gives usenew versionsof equations(2.12) and(2.13)³ ¢£´ � Wlµ sign̈¤ ¦£¶�· ª£X ´ d ° ¢m¸ W z ; W � < W #Qz�vc��L ´ (2.12)³ ¢ Î� Wlµ sign̈Y¤ ¦ ¶ · ª£Xb d ° ¢m¸ W z ; W �� < W #Qziv¡�£L (2.13)Note that sinceoneof the virtual examples will have the oppositesign from G¢ , onethe the lossesof thevirtualexampleswill benon-negligible. Thisservestwo purposes.Thefirst is thatthisconstrainstheboostingalgorithm from over-fitting. Thesecondis thatthis givesusa lot of flexibility in settingcontext priors. If weset È ´ and È to different values,theprior will have a biastowarda prediction. Theoverall magnitudeofÈ ´ and È determine therelative weightof thepriors versusthedata.Whencombinedwith output codes,wecanusethesepriors to incorporateinformationaboutcounts of differentsymbols in agiven context usingDirichlet mixturesor MCAs.

    3For even suffix contexts over aminoacids,thememoryrequirementsareexponential with thelength of thesubsequences.With ÏMÐaminoacids,there are ÑMÏ�Ò\ÐMÐMÐ�ÒPÐMÐMÐ contexts representing subsequencesof length Ó .

  • CHAPTER2. LARGE MARGIN PREDICTIONTREES 19

    2.5 Learning with SVMs

    We canview thespaceof predicatesasan � dimensional vectorspacewhichwereferto asthefeaturespace.We usethenotation4� < W # to denote theimageof eachinputstring < W in this vector space.

    We canview thefunction b� < # in equation(2.1) asdefininga hyperplane in this spacesinceit isa linearcombination of weights eachcorresponding to a dimensionin the feature space.In this view, a“good” hyperplane wouldseparatethepositiveexamplesfrom thenegativeexamples.

    We canusetheSupport VectorMachine (SVM) algorithm to obtainsucha hyperplane.TheSVMalgorithm minimizesthefollowing objective functionzYz qzYz h� v/ WÔ W (2.14)with constraints 0� < W #bÕv ×Ö v � % Ô W for ; W � v � (2.15)0� < W #bÕv ×Ø % � v Ô W for ; W � % � (2.16)Ô W Ö � (2.17)Thetwo termsin theoptimizationrepresenta tradeoff betweena hyperplane thatwill generalizewell versusminimizing theerrorsoverthetrainingsetand / is ascalingfactorfor this tradeoff. Thefirst termrepresenttheflatnessof thehyperplanewhile thesecondtermrepresentsthesoftmargin, wherethe Ô W arenon-zeroforany trainingexamples thatareon thewrongsideof thehyperplane.

    Oneadvantageousproperty of SVMs is that we areableto representthe optimizationin termsofLagrangemultipliers. This form is referredto asthedualform. In this form, theoptimization problemis tomaximize W È W % �� W ¸ È W È ; W ; 0� < W #bÕ0� < # (2.18)with constraints � Ø È W Ø=/ (2.19) W È W ; W �� (2.20)where È W is a Lagrange multiplier. For eachinput training point where È W � � , the corresponding inputelement< W is referredto asa support vector. We canrepresent the hyperplane in termsof the supportvectors � W È W ; W 0� < W # (2.21)Thusthedotproductwith thenormalvectorof thehyperplanecanberepresentedasthedotproductwith thesupport vectors,whichgives usasolutionfor b� < # .4b� < # � W È W ; W 0� < W #Õ < (2.22)Note that we do not needto compute the dot productswith any vectors that arenot support vectors sinceÈ W �?� andthusthey donotcontributeto thesum.

    Wecanthendecomposethishyperplaneintoafeaturevectorof theform (2.1) sincewecanexplicitlycomputethevectors 0� < W # for thesupport vectors.

    4This is dropping thebias term Ù which we canincorporatebut omitted for clarity.

  • CHAPTER2. LARGE MARGIN PREDICTIONTREES 20

    2.5.1 SVMs over SparseSequences

    In orderto solve theoptimization problemin SVM, all we needto compute is a kernelmatrix. This matrixstoresthepairwisevalues of dotproductsbetweentheinput elements.

    We canapply theSVM algorithm to sparsesequence modelingasfollows. Our featurespacefor theSVM is thevectorof predicatevaluesfor eachsequence.

    We candefinea kernelfunctiondirectlyover thesequencesthatcomputesthedotproductsbetweentheimagesof sequencesin this featuremap.Notethatapredicatefeature onlycontributesto thedotproductbetweensequences< and ; only if both � < # ��� and ��;# ��� .

    Computingthekernelmatrixrequiresatraversalof thesetof predicates.For eachpredicate duringthis traversal,for every pair of input stringsthatsatisfythepredicate,we updatethecorrespondingvalueofthe kernel matrix. After traversing all of the predicates, we areleft with a kernel matrix for the data. Thesamedatastructurecanefficiently computethekernel matrix.

    2.6 Efficient Data Structur es

    Thedatastructuresallow for anefficientwayto iteratethrough all of thepredicatesandto quickly retrieveallof theinput stringsthatsatisfyapredicate.There is adifferentdatastructurefor eachtypeof predicate.Sinceall of thedatastructures have many similarities,we describethedatastructureusedfor mismatchmodelsindetailandthenlaterdiscussthedifferencesbetweenmismatchmodeldatastructuresanddatastructures forothertypesof sparsemodels.

    Thedatastructureusedfor mismatchmodels is calleda mismatch tree. Thebasicideabehindourapproach is to traversethe featurespace(the setof all � -mers)in lexical orderandefficiently keeptrackof which observed � -length subsequencesin the sampledatasetmatchthe current target � -merwithin theallowablenumberof mismatches.In fact,weonly needto traversethesetof all � -mersthatoccurin thedata(with mismatches).Thealgorithm is efficient becauseonly onetraversalof thesetof � -mersis necessarytocomputetheentirekernel matrix. Thealgorithm is similar to thealgorithm usedfor discovering patternsthatoccurwith mismatchespresentedin [108, 97].

    2.6.1 Mismatch TreeData Structure

    Mismatchtreesaresimilar to suffix trees,which havea long historyof applications to stringmatching prob-lems[52]. Thepathsfrom therootto theleavesin amismatchtreerepresentnotonly thesubstringsin thedata(like in suffix trees),but alsoall neighborsof thesesubstringswith up to 9 mismatches.Thedatastructureis a variation of thesparsepredictiontrees[38] thatarebuilt onanideapresentedin [100].

    A �\�¥ ^9Ú# -mismatchtreeis arootedtreewhereeachinternalnodehas��� (moregenerally, � ) branches,eachlabeledwith anaminoacid (symbol from { ). Themaximum depth of the treeis � . Eachnodein themismatchtreecorrespondsto thesubspaceof all possible� -merswith afixedprefix,obtainedby concatenat-ing thebranch symbolsalongthepathin thetreefrom theroot to thenode.Theroot nodeof themismatchtreecorrespondsto thespaceof all possible� -mers, { ¿ . In addition, at eachnode at depth � , we maintainpointers to all substringsfrom the sampledatasetwhose � -lengthprefixes arewithin 9 mismatchesfromthe � -lengthprefix represented by the pathdown from the root; this setof substrings representsthe validinstancesof the � -lengthprefix in thedata.

    Notethatthesetof valid substringsfor a node is asubsetof thesetof valid substringsfor theparentof thenode. We efficiently generatethesetof valid substringsfor a nodeby keeping trackof thenumberofmismatchesbetweeneachvalid substringandtheprefix of thenode. For a valid substring in theparent of a

  • CHAPTER2. LARGE MARGIN PREDICTIONTREES 21

    node, therearetwo cases.Eitherthepositioncorresponding to thebranch to thechild matchesthesubstring,or the positioncorrespondingto thebranch to the child doesnot matchthesubstring.In thefirst case,thesubstringis still valid for thechild. In thesecondcase,thecountof mismatchesfor thatsubstringincreasesby one.If themismatchcount exceeds thethreshold 9 , thesubstringis notpassedon to thechild.

    A nodeof depth� correspondstoa � -merin thefeaturespaceandstorespointerstoall of its instances(with mismatches).At thesenodes,we canprocessthedatain orderto eithercompute thenext iterationofweightsfor theboostingalgorithm or compute thekernelmatrix for theSVM algorithm.

    Our algorithm is asfollows. We first examinetherootnodethatcorrespondsto thesetof all � -mersthatoccurin thedata.This nodecontains a pointerto all substrings in thesample.We thenexamine thefirstchild, * . This child points to all of thesubstrings in thesamplethathave prefix * (with � mismatches)andto all of thesubstrings in thesamplethathaveadifferentprefix(with � mismatch).Wecontinuewith adepthfirst searchuntil eitherwe reacha leaf node, in which casewe updatethekernel matrix asexplainedabove,or wereachanodethatcontains no instances,in whichcasewenolongerhaveto examine below thecurrentnode. After reaching a leaf node or a nodethatcontainsno instances,we backtrack andcontinue thedepthfirst search.

    2.6.2 Advantagesof the Mismatch TreeApproach

    Thetraversalof themismatchtreeis efficient becausewe needonly searchdown pathscorrespondingto � -mersthatoccur(with mismatches)in thedata.Notethateven whentraversingthe � -merswith nomismatchesit is muchmore efficient to usea mismatchtreetraversalthanto representthe � -mersin eachsequence. Inthe � -mismatchcase,thetraversaltime of thetreeis linearin thelengthof theinput data.

    Anotheradvantageof the mismatchalgorithm – leadingto fastcompute timesin practice– is theefficient useof memory. Sincethesearchis equivalentto traversing thefeature spacein lexical order, whenwe backtrack in the tree,we collapsethecurrentnode andexpandthenext node. Sincetheonly expandednodes arealongthecurrent searchpath,thereis a maximum of � storednodesin thetree(counting therootnode). Thisboundsthememory usageof thealgorithm andmakesthecomputationfeasible.

    Therearevaluesof �3�º ^9Ú# for which themismatchcomputationwill beslow. Thenumberof sub-stringsthatmatchagiven observedsubstringin thedatawithin a 9 mismatchesgrowsexponentiallywith 9 ,andwhen9 is largecomparedto � , almostall theleafnodeswill needto bevisitedin thetraversal.However,kernelsin which thenumber of mismatches9 is largecomparedto thelength � areunlikely to beusefulforapplications. In practice, thetraversalof thefeaturespaceis quitefeasible.

    2.6.3 Exampleof Mismatch TreeData Structure

    Considera very simpleexample of computing the kernelmatrix for � -mersof length ( with up to � mis-matchin theinput sequenceAVLALKAVLL. Thesubstring( ( -mers) in theinput sequencesareAVLALKAV,VLALKAVL andLALKAVLL. Clearly, in apracticalproblem,therewouldbethousandsof substrings.Figure2.1ashows the initial treewhile Figure2.1bshows the treeafterexpanding thebranch* . As we continueexpandingthetree,wenotethatfor many of the � -mers,thenumberof mismatcheswill exceedthemaximumallowedso that theseinstancesarenot passedon the leaf nodes.Thus theactualtime to updatethekernelmatrixby processinga leafat depth� is smallsincethereareusuallyfew instances.