![Page 1: Sub-language Processing for phenotype curation](https://reader035.vdocument.in/reader035/viewer/2022062411/568166bb550346895ddac32a/html5/thumbnails/1.jpg)
Hong CuiUniversity of Arizona
Phenotype RCN Feb 25-27, 2013
SUB-LANGUAGE PROCESSING FOR PHENOTYPE CURATION
![Page 2: Sub-language Processing for phenotype curation](https://reader035.vdocument.in/reader035/viewer/2022062411/568166bb550346895ddac32a/html5/thumbnails/2.jpg)
Agenda• CharaParser• Methodology• Evaluation• Applications• CharaParser for Phenoscape• New modules• Evaluations• Challenges
![Page 3: Sub-language Processing for phenotype curation](https://reader035.vdocument.in/reader035/viewer/2022062411/568166bb550346895ddac32a/html5/thumbnails/3.jpg)
“Fine-Grained Semantic Mark-up”• To annotate factual information from textual morphological descriptions of biodiversity in such a detailed manner that the machine readable annotation itself provides information equivalent to the original text.
![Page 4: Sub-language Processing for phenotype curation](https://reader035.vdocument.in/reader035/viewer/2022062411/568166bb550346895ddac32a/html5/thumbnails/4.jpg)
An Example
![Page 5: Sub-language Processing for phenotype curation](https://reader035.vdocument.in/reader035/viewer/2022062411/568166bb550346895ddac32a/html5/thumbnails/5.jpg)
![Page 6: Sub-language Processing for phenotype curation](https://reader035.vdocument.in/reader035/viewer/2022062411/568166bb550346895ddac32a/html5/thumbnails/6.jpg)
Previous Research• Syntactic parsing approach (Taylor, 1995 ; Abascal & Sanchenz, 1999; Vanel, 2004)• Interactive extraction (Diederich, J., Fortuner, R. & Milton, J. 1999).• Semi-supervised bootstrapping for lexicons (Ellen Riloff, 1999) • Supervised regular expression rule learning (Soderland, 1999; Tang & Heidorn 2008)•Ontology driven and parallel text (Woods et. al. 2004)• Supervised association rule learning (Cui & Heidorn, 2007)
![Page 7: Sub-language Processing for phenotype curation](https://reader035.vdocument.in/reader035/viewer/2022062411/568166bb550346895ddac32a/html5/thumbnails/7.jpg)
![Page 8: Sub-language Processing for phenotype curation](https://reader035.vdocument.in/reader035/viewer/2022062411/568166bb550346895ddac32a/html5/thumbnails/8.jpg)
![Page 9: Sub-language Processing for phenotype curation](https://reader035.vdocument.in/reader035/viewer/2022062411/568166bb550346895ddac32a/html5/thumbnails/9.jpg)
![Page 10: Sub-language Processing for phenotype curation](https://reader035.vdocument.in/reader035/viewer/2022062411/568166bb550346895ddac32a/html5/thumbnails/10.jpg)
General-Purpose Parsers?
![Page 11: Sub-language Processing for phenotype curation](https://reader035.vdocument.in/reader035/viewer/2022062411/568166bb550346895ddac32a/html5/thumbnails/11.jpg)
CharaParser Approach1. Unsupervised machine learning to find anatomy and
character terms from descriptions automatically• No need to prepare training examples• 50% - 80% terms learned
2. General-purpose syntactic parser (e.g., Stanford Parser) to parse syntactic structure of sentences• No need to create special-purpose, domain-dependent
parser• Learned lexicon from 1 is used to adapt the Parser for
biodiversity domains3. Intuitive rules to produce annotations from parse
trees.
![Page 12: Sub-language Processing for phenotype curation](https://reader035.vdocument.in/reader035/viewer/2022062411/568166bb550346895ddac32a/html5/thumbnails/12.jpg)
Unsupervised lexicon learning
If it is known “roots” is an organ:
•Roots yellow to medium brown or black, thin.• Petals yellow or white• Petals absent;• Subtending bracts absent;• Abaxial hastula absent;
![Page 13: Sub-language Processing for phenotype curation](https://reader035.vdocument.in/reader035/viewer/2022062411/568166bb550346895ddac32a/html5/thumbnails/13.jpg)
CharaParser: Term Reviewer
![Page 14: Sub-language Processing for phenotype curation](https://reader035.vdocument.in/reader035/viewer/2022062411/568166bb550346895ddac32a/html5/thumbnails/14.jpg)
Ontology Term Organizer
![Page 15: Sub-language Processing for phenotype curation](https://reader035.vdocument.in/reader035/viewer/2022062411/568166bb550346895ddac32a/html5/thumbnails/15.jpg)
Compared against a Heuristics-Based Method
• Parser performance evaluated on the same data sets.• CharaParser: unsupervised learning + Stanford Parser• Heuristics-based: unsupervised learning + regular expression rules
![Page 16: Sub-language Processing for phenotype curation](https://reader035.vdocument.in/reader035/viewer/2022062411/568166bb550346895ddac32a/html5/thumbnails/16.jpg)
Annotation Problems• Chunk errors:• Leaves oblanceolate to lanceolate, largest 14–20(–40) × 3–
4(–5) mm, pliant;• Attachment errors:• on outer cypselae, crowns of bristlelike scales ca. 0.5 mm;
on inner, of dusky white or pale yellow, plumose bristles 5–6 mm.
• Semantics:• straight posterolateral bounding ridges to subtriangular ,
bilobed ventral muscle field;
![Page 17: Sub-language Processing for phenotype curation](https://reader035.vdocument.in/reader035/viewer/2022062411/568166bb550346895ddac32a/html5/thumbnails/17.jpg)
Applications at Various Development Stages• Convert XML markup to • SDD for identification key generation• Character matrices for tree of life• RDF for the Semantic Web and search • Use marked-up descriptions to support search• FNA Experimental Search • Data source is RDF triples• Allow character based search• Plants that give yellow flowers at 200-400 meter elevation in April in North
Carolina
![Page 18: Sub-language Processing for phenotype curation](https://reader035.vdocument.in/reader035/viewer/2022062411/568166bb550346895ddac32a/html5/thumbnails/18.jpg)
![Page 19: Sub-language Processing for phenotype curation](https://reader035.vdocument.in/reader035/viewer/2022062411/568166bb550346895ddac32a/html5/thumbnails/19.jpg)
To-Dos• Tighter integration of ontologies in annotation process.• Currently internal glossaries are used in place of
ontologies to link a character state (e.g., “red”) to a character (“color”)• Synonyms are not controlled• “Petiolate” = “with petiole”
• Continue to reduce annotation errors• Accommodate various syntactic styles • Diagnosis paragraphs• Comparison among different taxa
![Page 20: Sub-language Processing for phenotype curation](https://reader035.vdocument.in/reader035/viewer/2022062411/568166bb550346895ddac32a/html5/thumbnails/20.jpg)
Phenotype Curation• Convert character and character state information from natural language descriptions to EQ statements
![Page 21: Sub-language Processing for phenotype curation](https://reader035.vdocument.in/reader035/viewer/2022062411/568166bb550346895ddac32a/html5/thumbnails/21.jpg)
Curator Mental Process
readdescription
Identify key phrases (raw EQ)
ontologized EQ
ontologies
![Page 22: Sub-language Processing for phenotype curation](https://reader035.vdocument.in/reader035/viewer/2022062411/568166bb550346895ddac32a/html5/thumbnails/22.jpg)
Adapted CharaParserCharacter Description
State Descriptions
CharaParser
XML to Raw EQs
Raw EQs to Final
EQs
Ontologies
![Page 23: Sub-language Processing for phenotype curation](https://reader035.vdocument.in/reader035/viewer/2022062411/568166bb550346895ddac32a/html5/thumbnails/23.jpg)
Evaluations• Internal evaluation: • The development corpus (three publications on fishes and archosaurs)
provided 1,200 character descriptions. 100 of them included in the internal evaluation benchmark.• Raw EQ performance: 90%• Final EQ performance: 50% • BioCreative2012 evaluation:• 50 descriptions independently selected by the organizer (>50% Qs
were not in ontologies)• Gold standard created by chief phenoscape curator (raw and final)• Three biocurators worked in two modes (Phenex vs.
Phenex+CharaParser)• Raw EQ performance: CharaParser better than biocurators• Final EQ perfoamnce: biocuration better than CharaParser • Inter-curator agreements:
![Page 24: Sub-language Processing for phenotype curation](https://reader035.vdocument.in/reader035/viewer/2022062411/568166bb550346895ddac32a/html5/thumbnails/24.jpg)
Inter-Curator AgreementsPrecision Recall
Curator 1 vs 2 39 49Curator 1 vs 3 47 56Curator 2 vs 3 77 71
![Page 25: Sub-language Processing for phenotype curation](https://reader035.vdocument.in/reader035/viewer/2022062411/568166bb550346895ddac32a/html5/thumbnails/25.jpg)
Error Analyses• Various fixable syntactic problems• E.g., “digits I-III”
• Curation granularity• CharaParser generated more candidate EQs than curators• “Preopercular latero-sensory canal leaves preopercle at first
exit and enters a plate: yes/no”• Annotating relations (relational quality)• “contact between …”
![Page 26: Sub-language Processing for phenotype curation](https://reader035.vdocument.in/reader035/viewer/2022062411/568166bb550346895ddac32a/html5/thumbnails/26.jpg)
Ontology Access• Currently use keyword-based search• Class labels and exact, narrower, and related synonyms• False positives • acute(shape) =? acute (process)• "margin" is a broad synonym of "marginal zone of embryo" in
UBERON• Pre-composed terms in ontology• “ceratobranchial 5 tooth”, “rib of vertebra 5”, “body of
humerus”• Ambiguious term use in descriptions
• ‘epibranchial 1’ => epibranchial 1 element? bone? cartilage?• No matching
![Page 27: Sub-language Processing for phenotype curation](https://reader035.vdocument.in/reader035/viewer/2022062411/568166bb550346895ddac32a/html5/thumbnails/27.jpg)
Exploration of Solutions• Experimented with• Word sense disambiguation: • “crinkly” not in PATO• Candidate matches: [undulate->1.00000000000002]
[obovate->1.00000000000001] [flat->1.00000000000001] [flattened->1] [circinate->0.884697579551583]
• Experimenting with• Subsets• Specify included classes: e.g. classes related to vertebrates• Specify excluded classes: e.g. exclude certain developmental stages
• Ideas to try out: • Bootstrapping to narrow down the search space• starting from known classes• evaluating candidate matches based on the distances to the known classes
and other source of evidences.
![Page 28: Sub-language Processing for phenotype curation](https://reader035.vdocument.in/reader035/viewer/2022062411/568166bb550346895ddac32a/html5/thumbnails/28.jpg)
Annotation consistency• Instructions given to human curators are helpful to CharaParser • Restricted relation list:• http://phenoscape.org/wiki/Guide_to_Character_Annotation#Relations_used_for_post-compositions
![Page 29: Sub-language Processing for phenotype curation](https://reader035.vdocument.in/reader035/viewer/2022062411/568166bb550346895ddac32a/html5/thumbnails/29.jpg)
Feed more info to EQ generation module
Character Description
State DescriptionsCharaPars
er
XML to EQs
Raw EQs to Final EQs
Ontologies
![Page 30: Sub-language Processing for phenotype curation](https://reader035.vdocument.in/reader035/viewer/2022062411/568166bb550346895ddac32a/html5/thumbnails/30.jpg)
Recent Improvements• Explorer of Taxon Concepts project• Making it a pure-java program/web-based application• Currently requires MySQL + Perl• Making it faster• Optimization of the program• Removing MySQL and reducing I/O• “Parallel” computing using java threads
• Preliminary evaluation shows • 20 times faster: 2 sec/taxon description• Memory requirements increased by 3 folds
![Page 31: Sub-language Processing for phenotype curation](https://reader035.vdocument.in/reader035/viewer/2022062411/568166bb550346895ddac32a/html5/thumbnails/31.jpg)
Acknowledgements• Fine-Grained Semantic Markup Project (current and past)• James Macklin: Agriculture and Agri-Food Canada • Robert (Bob) Morris, Alex Dusenbery: UMass-Boston• Hariharan Gopalakrishnan, Zilong Chang, Thomas Rodenhausen, Mohan
Krishna Gowda, ParthaPartha Pratim Sanyal, Chunshui Yu: University of Arizona
• Phenoscape Project• Chris Mungall: Laurence Berkeley National Lab• Melissa Haendel : Oregon Health & Science University • Paula Mabee, Alex Dececchi: University of South Dakota• Jim Balhoff, Wasila Dahdul, Hilmar Lapp, Todd Vision: NESCent
• NSF ABI and EF Programs• The Flora of North American Project