ibn sina

122
Please turn off your mobiles or put them on silence mode

Upload: yasmine-gaber

Post on 21-Jan-2015

702 views

Category:

Technology


3 download

DESCRIPTION

Presentation to describe Ibn Sina graduation project. Presentation given in July 2010.

TRANSCRIPT

  • 1. Please turn off your mobiles or putthem on silence mode

2. Biological Relation Extraction Tools Using BiomedicalOntologies and Text Mining 3. Agenda Introduction to Biomedical Text Mining System Overview Problem Description Motivation Challenges System Framework Application upon System Framework Swansons Algorithm Protein to Protein Interactions (PPI) Gene Clustering based on Text Mining Extended Work Conclusion and Future Work. 4. Agenda Introduction to Biomedical Text Mining System Overview Problem Description Motivation Challenges System Framework Application upon System Framework Swansons Algorithm Protein to Protein Interactions (PPI) Gene Clustering based on Text Mining Extended Work Conclusion and Future Work. 5. Introduction to Biomedical TextMining Text Mining = Process unstructured (textual)information, extract meaningful data, make theinformation contained in the text accessible to thevarious data mining (statistical and machine learning)algorithms. Biomedical Text Mining = Working on biomedical documents. 6. Agenda Introduction to Biomedical Text Mining System Overview Problem Description Motivation Challenges System Framework Application upon System Framework Swansons Algorithm Protein to Protein Interactions (PPI) Gene Clustering based on Text Mining Extended Work Conclusion and Future Work. 7. System Overview Problem Description Huge amount of information stored in million ofdocuments These information can be used effectively to solve manyproblems Knowledge retrieval with no much effort Discover relationship between different entities Assessing relationship strength between different entities Group entities into different clusters 8. System Overview Motivation: Build semantic structure of documents whichfacilitates navigation through thousands ofdocuments. Extract relationships between biomedical terms usingtext mining techniques with aid of biomedicalontologies. Using text mining to group genes into different clusters. 9. System Overview Challenges: Concept Recognition Build semantic structure of annotated documents usingontologies Relationship Recognition Similarity (distance) between different entities. 10. Overall System Components Framework Searching and Browsing Swansons Algorithm PPI Gene Clustering 11. Overall System Architecture SearchingGene SwansonsPPI&Clustering Algorithm BrowsingFramework 12. Agenda Introduction to Biomedical Text Mining System Overview Problem Description Motivation Challenges System Framework Application upon System Framework Swansons Algorithm Protein to Protein Interactions (PPI) Gene Clustering based on Text Mining Extended Work Conclusion and Future Work. 13. System Framework Agenda Objective Framework Concept Issues Framework Design Issues Framework Sequence Diagram Framework Database Framework GUI Framework Demo 14. System Framework Agenda Objective Framework Concept Issues Framework Design Issues Framework Sequence Diagram Framework Database Framework GUI Comparison Framework Demo 15. System Framework Objective: Use ontologies to markup biomedical text documents. Based on established semantic links between documentsand ontology concepts, the goal is build semanticrepresentation of information. Provide services to other applications and users. 16. System FrameworkFrameworkConcept IssuesDesign Issues 17. System Framework Agenda Objective Framework Concept Issues Framework Design Issues Framework Sequence Diagram Framework Database Framework GUI Comparison Framework Demo 18. Framework Concept IssuesUser Expanded Query Query Expansion QueryFetchingDocuments Search PubMed Gene DocumentsOntologyExtract GO termsAnnotate PubMed documentsStructure Representation of documentsAnnotated Documents 19. System Framework PubMed: Largest documents source in the biomedical field Contains over 18 million documents Maintained by the United States National Libraryof Medicine (NLM) Indexes all documents by MeSH terms to facilitatesearching and retrieval 20. System Framework Gene Ontology: The Gene Ontology project is a majorbioinformatics initiative with the aim ofstandardizing the representation of gene and geneproduct attributes across species and databases Includes a controlled vocabulary of terms fordescribing gene product characteristics. Consists of three main categories Cellular component Biological process Molecular function 21. System Framework MeSH database: Comprehensive controlled vocabulary for the purpose of indexing journal articles and books in the life sciences; it can also serve as a thesaurus that facilitates searching [Wikipedia] MeSH main heading: Anatomy Organisms Diseases Chemicals and Drugs Analytical, Diagnostic and Therapeutic Techniques and Equipment Psychiatry and Psychology Phenomena and Processes Disciplines and Occupations Anthropology, Education, Sociology and Social Phenomena Technology, Industry, Agriculture Humanities Information Science Named Groups Health Care Publication Characteristics Geographical liocations 22. System Framework Query Expansion (QE):is the process of reformulatinga seed query to improve retrieval performance ininformation retrieval operations [Wikipedia] How ? Example 23. QueryExpansion Ocellus pigmentation Example PigmentPigment metabolic PigmentationaccumulationprocessCellular pigmentation 24. System Framework Documents Annotating Annotate documents with Gene Ontology Terms, Genesand proteins. Represent each documents by set of terms. (How ?) 25. GO extractorGOs vocabulary consists of 7,841 words. The majority of the GO words foundoccur only once in the whole ontology. On the other hand 51 of the GO wordsoccur at least 100 times in the ontology. More than 90%, do not occur morethan 10 times.words with a very high frequency do not give much information as they arepart of many labels in the ontology. However, extracting a word with a lowfrequency gives a much better hint about a mentioned concept. (Zipfs law).From the nature of GO-terms, the words in the end are very generalex.(activity , transport).Besides, many GO-terms are substring of descending GO-terms.The algorithm is taken from GOPubMed (2008) GoPubMed: Ontology-basedliterature search for the life sciences. 26. GO extractor algorithm Get lastwordComparSet main e with root as aroot NrootDo BFS The same wordNand take Reache YGetoccurred at each one ass leaf nextany siblinga rootwordY get next word & do BFS and consider eachone as a root 27. Go ExtractorExample:-Abstract............................................and its effected by the Kinase activity. Abstract. Starting from the last word of the paragraph activity.Starting from the root of the GO tree searching for GO-term ending withactivity. When we rich it, fetch the next word and starting from the new root. Now we are looking in the subtree for an ontology ends with Kinase activity.While on search we reach leaf . It means that we got a GO-term. Now restartby take the next word and from the root. 28. System Framework Agenda Objective Framework Concept Issues Framework Design Issues Framework Sequence Diagram Framework Database Framework GUI Comparison Framework Demo 29. Framework Design Issues Top Level Architecture of the System can be divided into:- Data Handling Components Information Handling Components Information Extraction Information Representation Information Retrieval 30. Class Diagram 31. System Framework Framework main components: Document Sources Extractor Document Annotators Ontology Manager System Engine Database Manager Cache Manager Document 32. System Framework Document Sources Fetching of singles or collections of documents fromremote stores. Extractor Implements Information Extraction algorithms to extractontology terms from the documents Document Annotators establish semantic link between documents and ontologyconcepts. For example linking documents with its GO terms, MeSHterms . . . etc. 33. System Framework Ontology Manager Provide interface to around ontologies Composed by sub-managers to merge ontologies such asGene ontology System Engine Main component of the system. Responsible for maintaining all the operations andcommunications between various components of thesystem 34. System Framework Database Manager implemented as a pool object (connections pool) handles and maintains queries to the database such insert, update and delete documents Cache Manager Implemented as client side of MemCached (open sourcecaching project). Handles operations to the system cache 35. System Framework Agenda Objective Framework Concept Issues Framework Design Issues Framework Sequence Diagram Framework Database Framework GUI Comparison Framework Demo 36. Framework Sequence Diagram 37. System Framework Agenda Objective Framework Concept Issues Framework Design Issues Framework Sequence Diagram Framework Database Framework GUI Comparison Framework Demo 38. Framework Database 39. System Framework Agenda Objective Framework Concept Issues Framework Design Issues Framework Sequence Diagram Framework Database Framework GUI Comparison Framework Demo 40. Framework GUI GUI Goals User friendly Consistency Model View Control (MVC) Human-Computer Interaction concepts Usability Specific Application services satisfaction Standard Data Exchange Internationalization 41. Framework GUI 42. System Framework Agenda Objective Framework Concept Issues Framework Design Issues Framework Sequence Diagram Framework Database Framework GUI Comparison Framework Demo 43. Our systemTextpresso XplorMedVivismoOntology Full Gene Only 30 Top hierarchy Driveused Ontologycategoryofontology the MeSHfrom the ontologysearch resultOutput Uses the deep Returns a listFor eachReturns a list ontology toof relevantMeSHof relevant navigate abstract category, abstract through a there is an large result setassociated list in a non- sequential order 44. IBN-SINA vs. Others IBN-SINA Textpresso XplorMed VivismoWorks on works on all Designed for works on all works on all the PubMed full paper which the PubMed the PubMed abstractsnot availableabstractsabstractsmost of thetimeTerm Allows gapsTries to nd the Extract terms Extract termsExtraction within category termsbased onbased on term matches anddirectly in the termfrequency in considers thetext only frequency inthe collected informationallowingthe collected documents content of the for somedocuments words, which variations in leads to morelower/uppercas rened term e letters and extraction plural forms 45. Our System Vs. GoPubMed 46. System Framework Agenda Objective Framework Concept Issues Framework Design Issues Framework Sequence Diagram Framework Database Framework GUI Comparison Framework Demo 47. Framework DemoDEMO 48. Agenda Introduction to Biomedical Text Mining System Overview Problem Description Motivation Challenges System Framework Application upon System Framework Swansons Algorithm Protein to Protein Interactions (PPI) Gene Clustering based on Text Mining Extended Work Conclusion and Future Work. 49. Overall System Architecture SearchingGene SwansonsPPI&Clustering Algorithm BrowsingFramework 50. Swanson Algorithm(1986)Swansons method is a away of finding indirect relations betweenobjects.A B Related Related term A1 term B1 Related Related term A2 term B2 1986: Undiscovered public knowledge 51. Cosine SimilarityCosine similarity is a measure of similarity between two vectors of ndimensions by finding the cosine of the angle between them, often used to compare documents in text mining [Wikipedia]. Terms related to first term As related termsA B CD EFGH Terms related to second term Bs related termsA X YB ZDEFA B C D EF GH XY Z1 1 1 1 11 11 00 0A B C D EF GH XY Z1 1 0 1 11 00 11 1 52. Cosine Similarity (Cont.) Finally, applying cosine similarity function :-A B C DE FG HX Y Z1 1 1 11 11 10 0 0A B C DE FG HX Y Z1 1 0 11 10 01 1 1Similarity = (1+1+0+1+1+1+0+0+0+0+0)/ (8*8) = 5/8 = 0.625 53. Swanson exampleRelation between P53 and P51 1986: Fish oil, Raynauds syndrome, and undiscovered public knowledge 54. Agenda Introduction to Biomedical Text Mining System Overview Problem Description Motivation Challenges System Framework Application upon System Framework Swansons Algorithm Protein to Protein Interactions (PPI) Gene Clustering based on Text Mining Extended Work Conclusion and Future Work. 55. Overall System Architecture SearchingGene SwansonsPPI&Clustering Algorithm BrowsingFramework 56. PPI Agenda Problem Description Motivation PPI System Overview PPI System Main Components Dependency Parse Tree Similarity Metrics K-Nearest Neighbor Classifier Evaluation of PPI Evaluation Metrics Results and Comparison 57. PPI Agenda Problem Description Motivation PPI System Overview PPI System Main Components Dependency Parse Tree Similarity Metrics K-Nearest Neighbor Classifier Evaluation of PPI Evaluation Metrics Results and Comparison 58. Problem Description Due to the ever growing amount of publications about protein-protein interactions, information extraction from text is increasingly recognized as one of crucial technologies in bioinformatics Reference: Gunes Erkan, Arzucan Ozgur, Dragomir R. Radev. Semi-Supervised Classication for Extracting Protein Interaction Sentences using Dependency Parsing. Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 228237, Prague, June 2007 59. PPI Agenda Problem Description Motivation PPI System Overview PPI System Main Components Dependency Parse Tree Similarity Metrics K-Nearest Neighbor Classifier Evaluation of PPI Evaluation Metrics Results and Comparison 60. Motivation The interactions between proteins are important forvery numerous if not all biological functions. The function of a protein can be characterized moreprecisely through knowledge of PPI. Information about these interactions improves ourunderstanding of diseases and can provide the basisfor new therapeutic approaches. Validate experimental results and test benches. 61. PPI Agenda Problem Description Motivation PPI System Overview PPI System Main Components Dependency Parse Tree Similarity Metrics K-Nearest Neighbor Classifier Evaluation of PPI Evaluation Metrics Results and Comparison 62. System Overview We worked on Sentence level (Why?) It increases the semantic understood from the sentence. Synthesis of the sentence increases the knowledgeobtained from it. Specific relation between proteins can be deduced fromit. 63. System Overview 64. System Overview Our approach depends on:The shortest path between the entities in dependencytree of a sentence usually captures the necessaryinformation to identify their relationship. 65. PPI Agenda Problem Description Motivation PPI System Overview PPI System Main Components Dependency Parse Tree Similarity Metrics K-Nearest Neighbor Classifier Evaluation of PPI Evaluation Metrics Results and Comparison 66. Dependency Parse Tree 67. Dependency Parse Tree Unlike a syntactic parse, it captures the semantic predicate-argument relationships among its words. Stanford Parser API to make the Natural Languageprocessing task. Shortest path is found using Breadth First Search(BFS) as each edge has equal wait, and therefore thisleads to most near path discovered first. 68. Dependency Parse Tree (Example) "The dependency tree of the sentence The results demonstratedthat KaiC interacts rhythmically with KaiA, KaiB, and SasA. 69. Example (Cont.) Then, we select the shortest paths between the protein pairs: KaiC - nsubj - interacts - prep with SasA KaiC - nsubj - interacts - prep with - SasA - conj and -KaiA KaiC - nsubj - interacts - prep with SasA - conj and KaiB SasA - conj and KaiA SasA - conj and KaiB KaiA conj and SasA - conj and - KaiB 70. Example (Cont.) Then, we rename the proteins in the pair as PROTX1 and PROTX2, and all the other proteins in the sentence as PROTX0: PROTX1 - nsubj - interacts - prep with - ROTX2 PROTX1 - nsubj - interacts - prep with - ROTX0 conj_and -PROTX2 PROTX1 - nsubj - interacts - prep with ROTX0 conj_and -PROTX2 PROTX1 conj_and - PROTX2 PROTX1 conj_and - PROTX2 PROTX1 conj_and PROTX0 conj_and - PROTX2 71. PPI Agenda Problem Description Motivation PPI System Overview PPI System Main Components Dependency Parse Tree Similarity Metrics K-Nearest Neighbor Classifier Evaluation of PPI Evaluation Metrics Results and Comparison 72. Similarity Metrics 73. Similarity Metrics The main idea of using similarity metrics is tofind a function that maps input patterns into atarget space such that a simple distance in thetarget space approximates the semanticdistance in the input space. 74. Similarity Metrics We implemented Levenshtein distance (EditDistance). number of transpositions, substitutions and deletionsneeded to transform one string into another. We also used an open source library calledSimMetrics Java library of 23 string similaritymetrics. Developed at the University of Sheffield (Chapman,2004) 75. Similarity Metrics We used only 10 string similarities from SimMetrics. Cosine Similarity Block Distance Dice Similarity Euclidean Distance Jaccard Similarity Jaro Similarity Jaro Winkler Similarity Matching Coecient Monge Elkan Similarity 76. PPI Agenda Problem Description Motivation PPI System Overview PPI System Main Components Dependency Parse Tree Similarity Metrics K-Nearest Neighbor Classifier Evaluation of PPI Evaluation Metrics Results and Comparison 77. K-Nearest Neighbor Classifier 78. K-Nearest Neighbor Classifier k nearest neighbor-assign label according to the majority label of k nearest-neighboor training patterns. 79. KNN Example If k = 3, it is classified asa triangle k = 5, it is classified as asquare 80. KNN Strengths and Weaknesses Strengths: Simple to implement and use Comprehensible easy to explain prediction Robust to noisy data by averaging k-nearest neighbors 81. KNN Strengths and Weaknesses Weaknesses: Need a lot of space to store all examples. Takes more time to classify a new example than with amodel (need to calculate and compare distance from newexample to all other examples). 82. PPI Agenda Problem Description Motivation PPI System Overview PPI System Main Components Dependency Parse Tree Similarity Metrics K-Nearest Neighbor Classifier Evaluation of PPI Evaluation Metrics Results and Comparison 83. Evaluation of PPI 84. Evaluation of PPI we used five different datasets which are: BioInfer dataset. AIMed dataset. LLL dataset. IEPA dataset. HPRD50 dataset. We used KNN classier and changing K and similarity metric as parameters. 85. Confusion Matrix 86. Evaluation Metrics Precision: Recall: F-measure: 87. PPI Agenda Problem Description Motivation PPI System Overview PPI System Main Components Dependency Parse Tree Similarity Metrics K-Nearest Neighbor Classifier Evaluation of PPI Evaluation Metrics Results and Comparison 88. Results 89. Results 90. Results 91. Results 92. Results 93. Results and ComparisonDatasetMin. Result Max. ResultBioInfer 3256.9AIMed5 48.9LLL48.873IEPA 36.672HPRD50 12.963.49 94. Our PPI System Vs. Graph Kernel ApproachDatasetOur System Graph Kernel (%)Approach (%)BioInfer 56.9 52.9AIMed48.9 56.4LLL73 76.8IEPA 72 75.1HPRD50 67 63.4 95. Agenda Introduction to Biomedical Text Mining System Overview Problem Description Motivation Challenges System Framework Application upon System Framework Swansons Algorithm Protein to Protein Interactions (PPI) Gene Clustering based on Text Mining Extended Work Conclusion and Future Work. 96. Overall System Architecture SearchingGene SwansonsPPI&Clustering Algorithm BrowsingFramework 97. Motivation Goal : Grouping genes according some features . Challenges : Large number of genes . The complexity of biological networks . 98. Motivation The solution is : Gene Clustering 99. Gene Clustering Techniques Based on Gene Expression : Advantages : High Accuracy . Disadvantages : High cost . Time Consuming . Noise . 100. Gene Clustering Techniques Based on Text Mining : Advantages : Low Cost . Low Time Consuming . Disadvantages : Low accuracy . 101. Gene Clustering Based on TextMining To perform Gene Clustering we need : Clustering Algorithms . Similarity Measurements . 102. Clustering Algorithms Hierarchical Algorithms . Partitioning Algorithms . Density-Based Algorithms . 103. Hierarchical Algorithms Single Linkage 104. Partitioning Algorithms K-Medoids 105. Density-Based Algorithms DBScan 106. Graph-Theoretic Algorithms Zahn Algorithm 107. Similarity Measurements Swanson Algorithm . Document Occurrences . 108. Swanson Algorithm Search PubMed for gene A and extract set A ( themost related keywords - MeSH or GO terms - ) . Search PubMed for gene B and extract set B ( the mostrelated keywords - MeSH or GO terms - ) . Based on the intersection between set A and set B, weapply the cosine similarity. 109. Document Occurrences Search PubMed for gene A and extract set A(documents Ids of gene A) . Search PubMed for gene B and extract set B(documents Ids of gene B). Based on the intersection between set A and set B, weapply the Jaccard Similarity Coefficient. 110. Agenda Introduction to Biomedical Text Mining System Overview Problem Description Motivation Challenges System Framework Application upon System Framework Swansons Algorithm Protein to Protein Interactions (PPI) Gene Clustering based on Text Mining Extended Work Conclusion and Future Work. 111. Extended Work: PPI System withSVM Classifier (1)Equation :u=wxb - Objective : min (1/2) || w||2subject to yi (w xib) 1, i 112. Extended Work: PPI System withSVM Classifier (2) min ( ) = min (1/2) yi yj (xi xj)i j i is called multiplier and if we can get we can get (w , b) . w = yi i xi , b = w xkyk for some k > 0 113. Agenda Introduction to Biomedical Text Mining System Overview Problem Description Motivation Challenges System Framework Application upon System Framework Swansons Algorithm Protein to Protein Interactions (PPI) Gene Clustering based on Text Mining Extended Work Conclusion and Future Work. 114. Conclusion Problem 1: Algorithms for concept recognition indocuments abstracts and titles We introduced an algorithm to annotate the Gene Ontology terms in the documents. Problem 2: Use the annotated documents to build astructured representation of documents We introduced how framework uses Gene Ontology to build a semantic representation of the obtained documents Problem 3: Design a system for ontology based searchengines for biological researchers We introduced design of the framework and how it is flexible for future modifications and scalable with respect to number of documents and number of users. 115. Conclusion Problem 4: Using Swansons algorithm to assess the similarity betweendifferent biological terms We introduced how can Swansons algorithm be used to estimate thesimilarity between two instances (P53 and P21) Problem 5: Supervised machine learning algorithms for prediction ofProtein to Protein interactions We introduced how we used supervised machine learning algorithms suchas KNN and a new technique to estimate the distance between sentence inorder to predict the possible interactions between proteins mentioned inthe documents. Problem 6: Unsupervised machine learning algorithms to identifydifferent clusters of Genes We introduced how we used unsupervised machine learning algorithmssuch as DBScan and the similarity based on Swanson Algorithms andCosine similarity in order to group genes mentioned in the documents indifferent clusters. 116. Future work There are hot research areas and open problems in the biological text mining The content Provider for Documents Google Scholar Using Semantic web 3.0 ( Online Journals ) The Ontology Generation Ability to Edit the Ontologies and Adding knowledge Other Ontologies Using Wikipedia as an Ontology 117. Future work There are some features that may be added to the System Biomedical Ontology based Search Engine Provide documents summary for each group of documents Allow the user to save and print the results obtained by the system. Protein-Protein Interaction (PPI) Use more sophisticated classifiers and machine learning techniquessuch as AdaBoost to enhance the classification process. Use a background knowledge of verbs as there are many verbs gives thesame meaning. This will help the system to have more accurate results, as we canintroduce some fuzzy distance to the differences between the meaningof verbs. This also will introduce the ability to discover the type ofrelations between the terms and to be more semantic relationsidentification. 118. Future work There are some features that may be added tothe System Gene Clustering Using more sophisticated clustering algorithms which originallydesigned for gene clustering. More Applications: Based on the services provided by the ontology basedengine, we can construct some applications such asextracting the relation between the drugs and diseases,group diseases in different clusters which decision helpsto identify the characteristics of a new discovered diseaseand other applications that relay on text mining inbiomedical documents.