nlp pipeline for protein mutation knowledgebase construction jonas b. laurila, nona naderi, rené...
TRANSCRIPT
![Page 1: NLP pipeline for protein mutation knowledgebase construction Jonas B. Laurila, Nona Naderi, René Witte, Christopher J.O. Baker](https://reader035.vdocument.in/reader035/viewer/2022062713/56649f525503460f94c75b59/html5/thumbnails/1.jpg)
NLP pipeline for protein mutation knowledgebase construction
Jonas B. Laurila, Nona Naderi, René Witte, Christopher J.O. Baker
![Page 2: NLP pipeline for protein mutation knowledgebase construction Jonas B. Laurila, Nona Naderi, René Witte, Christopher J.O. Baker](https://reader035.vdocument.in/reader035/viewer/2022062713/56649f525503460f94c75b59/html5/thumbnails/2.jpg)
Background
• Knowledge about mutations is crucial for many applications, e.g. Protein engineering and Biomedicine.
• Protein mutations are described in scientific literature.
• The amount of Information grow faster than manual database curation can handle.
• Automatic reuse of mutation impact information from documents needed.
![Page 3: NLP pipeline for protein mutation knowledgebase construction Jonas B. Laurila, Nona Naderi, René Witte, Christopher J.O. Baker](https://reader035.vdocument.in/reader035/viewer/2022062713/56649f525503460f94c75b59/html5/thumbnails/3.jpg)
Example excerpts
"Haloalkane dehalogenase (DhlA) from Xanthobacter autotrophicus GJI0 hydrolyses terminally chlorinated and brominated n-alkanes to the corresponding alcohols."
"The W125F mutant showed only a slight reduction of activity (Vmax) and a larger increase of Km with 1,2-dibromoethane."
• Directionality of impact • Protein property• Mutation
• Protein name • Gene name • Organism name
![Page 4: NLP pipeline for protein mutation knowledgebase construction Jonas B. Laurila, Nona Naderi, René Witte, Christopher J.O. Baker](https://reader035.vdocument.in/reader035/viewer/2022062713/56649f525503460f94c75b59/html5/thumbnails/4.jpg)
Mutation impact ontology
![Page 5: NLP pipeline for protein mutation knowledgebase construction Jonas B. Laurila, Nona Naderi, René Witte, Christopher J.O. Baker](https://reader035.vdocument.in/reader035/viewer/2022062713/56649f525503460f94c75b59/html5/thumbnails/5.jpg)
NLP framework
![Page 6: NLP pipeline for protein mutation knowledgebase construction Jonas B. Laurila, Nona Naderi, René Witte, Christopher J.O. Baker](https://reader035.vdocument.in/reader035/viewer/2022062713/56649f525503460f94c75b59/html5/thumbnails/6.jpg)
Named entity recognition
• Protein-, gene- and organism names– Gazetteer lists based on SwissProt– Mappings encoded in the MGDB
• Mutation mentions– MutationFinder ~700 regular expressions– normalize into wNm-format
![Page 7: NLP pipeline for protein mutation knowledgebase construction Jonas B. Laurila, Nona Naderi, René Witte, Christopher J.O. Baker](https://reader035.vdocument.in/reader035/viewer/2022062713/56649f525503460f94c75b59/html5/thumbnails/7.jpg)
Named entity recognition
Protein Properties1. Protein functions
– Noun phrases extracted with MuNPEx– Activity, binding, affinity, specificity as
head nouns
2. Kinetic variables– Jape rules to extract Km, kcat and Km/kcat in
current implementation
![Page 8: NLP pipeline for protein mutation knowledgebase construction Jonas B. Laurila, Nona Naderi, René Witte, Christopher J.O. Baker](https://reader035.vdocument.in/reader035/viewer/2022062713/56649f525503460f94c75b59/html5/thumbnails/8.jpg)
Mutation groundingLinking mutations positionally correct to target sequence
• Important for reuse of mutation mentions
• Levels of grounding:1.
2.
3.
![Page 9: NLP pipeline for protein mutation knowledgebase construction Jonas B. Laurila, Nona Naderi, René Witte, Christopher J.O. Baker](https://reader035.vdocument.in/reader035/viewer/2022062713/56649f525503460f94c75b59/html5/thumbnails/9.jpg)
mSTRAPviz
Structure annotation visualization
Mutations extracted from text visualized on the protein structure for which mutation grounding is a prerequisite.
![Page 10: NLP pipeline for protein mutation knowledgebase construction Jonas B. Laurila, Nona Naderi, René Witte, Christopher J.O. Baker](https://reader035.vdocument.in/reader035/viewer/2022062713/56649f525503460f94c75b59/html5/thumbnails/10.jpg)
Protein function grounding
• Mentions of protein functions are linked to correct Gene Ontology concepts.
• Previously grounded proteins and mutations provide us with hints.
• Grounding scored based on string similarity (later used during impact extraction)
![Page 11: NLP pipeline for protein mutation knowledgebase construction Jonas B. Laurila, Nona Naderi, René Witte, Christopher J.O. Baker](https://reader035.vdocument.in/reader035/viewer/2022062713/56649f525503460f94c75b59/html5/thumbnails/11.jpg)
Relation detection
• Impacts– Words describing directionality + protein
properties• Mutants
– Set of mutations giving rise to altered proteins
• Mutant – Impacts– The causal relation between mutants and
their impacts
![Page 12: NLP pipeline for protein mutation knowledgebase construction Jonas B. Laurila, Nona Naderi, René Witte, Christopher J.O. Baker](https://reader035.vdocument.in/reader035/viewer/2022062713/56649f525503460f94c75b59/html5/thumbnails/12.jpg)
OwlExporter
• Translates GATE Annotations to OWL instances
• Application independent• Literature Specifications added
automatically
• Used here to populate our Mutation impact ontology to create a mutation knowledgebase
![Page 13: NLP pipeline for protein mutation knowledgebase construction Jonas B. Laurila, Nona Naderi, René Witte, Christopher J.O. Baker](https://reader035.vdocument.in/reader035/viewer/2022062713/56649f525503460f94c75b59/html5/thumbnails/13.jpg)
Example query
Retrieve mutations that do not have an impact on haloalkane dehalogenase activity (also retrieve the Swissprot identifier of the protein beeing mutated).
![Page 14: NLP pipeline for protein mutation knowledgebase construction Jonas B. Laurila, Nona Naderi, René Witte, Christopher J.O. Baker](https://reader035.vdocument.in/reader035/viewer/2022062713/56649f525503460f94c75b59/html5/thumbnails/14.jpg)
Example query
Retrieve mutations on Haloalkane Dehalogenase that do not impact negatively on the Michaelis Constant.
![Page 15: NLP pipeline for protein mutation knowledgebase construction Jonas B. Laurila, Nona Naderi, René Witte, Christopher J.O. Baker](https://reader035.vdocument.in/reader035/viewer/2022062713/56649f525503460f94c75b59/html5/thumbnails/15.jpg)
Evaluation
Mutation grounding performance
![Page 16: NLP pipeline for protein mutation knowledgebase construction Jonas B. Laurila, Nona Naderi, René Witte, Christopher J.O. Baker](https://reader035.vdocument.in/reader035/viewer/2022062713/56649f525503460f94c75b59/html5/thumbnails/16.jpg)
What’s next?
• Modularize into a set of web services
• Database (re-)creation
• Reuse in phenotype prediction algorithms, (SNAP)*
*Bromberg and Rost, 2007
![Page 17: NLP pipeline for protein mutation knowledgebase construction Jonas B. Laurila, Nona Naderi, René Witte, Christopher J.O. Baker](https://reader035.vdocument.in/reader035/viewer/2022062713/56649f525503460f94c75b59/html5/thumbnails/17.jpg)
NLP pipeline for protein mutation knowledgebase construction
Jonas B. LaurilaCSAS, UNB, Saint [email protected]
Nona NaderiCSE, Concordia University, Montré[email protected]é WitteCSE, Concordia University, Montré[email protected] J.O. BakerCSAS, UNB, Saint [email protected]
AcknowledgementThis research was funded in part by :
• New Brunswcik Innovation Foundation, New Brunswick, Canada
• NSERC, Discovery Grant, Canada
• Quebec -New Brunswick University Co-operation in Advanced Education - Research Program, Government of New Brunswick, Canada