beespace informatics research chengxiang (“cheng”) zhai department of computer science institute...
DESCRIPTION
Overview of BeeSpace Technology Literature Text Search Engine Words/Phrases Entities Relations Natural Language Understanding Users Function Annotator Space/Region Manager, Navigation Support Gene Summarizer Relational Database Text Miner Meta Data Knowledge Discovery & Hypothesis Testing Information Access & Exploration Content Analysis Question Answering 3TRANSCRIPT
![Page 1: BeeSpace Informatics Research ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library](https://reader033.vdocument.in/reader033/viewer/2022052708/5a4d1b5b7f8b9ab0599ab334/html5/thumbnails/1.jpg)
BeeSpace Informatics Research
ChengXiang (“Cheng”) Zhai
Department of Computer ScienceInstitute for Genomic Biology
StatisticsGraduate School of Library & Information Science
University of Illinois at Urbana-Champaign
BeeSpace Workshop, May 22, 2009 1
![Page 2: BeeSpace Informatics Research ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library](https://reader033.vdocument.in/reader033/viewer/2022052708/5a4d1b5b7f8b9ab0599ab334/html5/thumbnails/2.jpg)
Goal of Informatics Research• Develop general and scalable computational methods
to enable– Semantic integration of data and information
– Effective information access and exploration– Knowledge discovery
– Hypothesis formulation and testing
• Reinforcement of research in biology and computer science– CS research to automate manual tasks of biologests
– Biology research to raise new challenges for CS
2
![Page 3: BeeSpace Informatics Research ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library](https://reader033.vdocument.in/reader033/viewer/2022052708/5a4d1b5b7f8b9ab0599ab334/html5/thumbnails/3.jpg)
Overview of BeeSpace Technology
Literature Text
Search Engine
Words/Phrases Entities Relations
Natural Language Understanding
UsersFunction Annotator
Space/Region Manager, Navigation Support
Gene Summarizer
Relational Database
Text Miner
Meta Data
Knowledge Discovery
& Hypothesis
Testing
InformationAccess &
Exploration
ContentAnalysis
QuestionAnswering
3
![Page 4: BeeSpace Informatics Research ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library](https://reader033.vdocument.in/reader033/viewer/2022052708/5a4d1b5b7f8b9ab0599ab334/html5/thumbnails/4.jpg)
Informatics Research Accomplishments
Literature Text
Search Engine
Words/Phrases Entities Relations
Natural Language Understanding
UsersFunction Annotator
Space/Region Manager, Navigation Support
Gene Summarizer
Relational Database
Text Miner
Meta Data
Knowledge Discovery
& Hypothesis Test
InformationAccess &
Exploration
ContentAnalysis
QuestionAnswering
Biomedical information retrieval [Jiang & Zhai 07], [Lu et al. 08]
Entity/Relation extraction [Jiang & Zhai 06], [Jiang & Zhai 07a], [Jiang & Zhai 07b]
Topic discovery and interpretation [Mei et al. 06a], [Mei et al. 07a], [Mei et al. 07b],
[Chee & Schatz 08]
Entity/Gene Summarization [Ling et al. 06], [Ling et al. 07], [Ling et al. 08]
Automatic Function Annotation [He et al. 09/10]
4
![Page 5: BeeSpace Informatics Research ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library](https://reader033.vdocument.in/reader033/viewer/2022052708/5a4d1b5b7f8b9ab0599ab334/html5/thumbnails/5.jpg)
Overview of BeeSpace Technology
Literature Text
Search Engine
Words/Phrases Entities Relations
Natural Language Understanding
UsersFunction Annotator
Space/Region Manager, Navigation Support
Gene Summarizer
Relational Database
Text Miner
Meta Data
Knowledge Discovery
&Hypothesis
Testing
InformationAccess &
Exploration
ContentAnalysis
QuestionAnswering
Part 1. Information Extraction
Part 2. Navigation Support
Part 3. EntitySummarization
Part 4. Function Analysis
5
![Page 6: BeeSpace Informatics Research ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library](https://reader033.vdocument.in/reader033/viewer/2022052708/5a4d1b5b7f8b9ab0599ab334/html5/thumbnails/6.jpg)
Part 1. Information Extraction
6
![Page 7: BeeSpace Informatics Research ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library](https://reader033.vdocument.in/reader033/viewer/2022052708/5a4d1b5b7f8b9ab0599ab334/html5/thumbnails/7.jpg)
Natural Language Understanding
…We have cloned and sequenced
a cDNA encoding Apis mellifera ultraspiracle (AMUSP)
and examined its responses to …
NP
NP NP
NPVP
VP VP
Gene Gene
7
![Page 8: BeeSpace Informatics Research ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library](https://reader033.vdocument.in/reader033/viewer/2022052708/5a4d1b5b7f8b9ab0599ab334/html5/thumbnails/8.jpg)
Entity & Relation Extraction
Gene X Gene YBcd hb…. ….… …
Genetic Interaction
Gene X Anatomy YBcd embryoHb egg… …
Expression Location
…8
Lopes FJ et al., 2005 J. Theor. Biol.
![Page 9: BeeSpace Informatics Research ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library](https://reader033.vdocument.in/reader033/viewer/2022052708/5a4d1b5b7f8b9ab0599ab334/html5/thumbnails/9.jpg)
General Approach: Machine Learning
• Computers learn from labeled examples to compute a function to predict labels of new examples
• Examples of predictions– Given a phrase, predict whether it is a gene name– Given a sentence with two gene names mentioned,
predict whether there is a genetic interaction relation
• Many learning methods are available, but training data isn’t always available
9
![Page 10: BeeSpace Informatics Research ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library](https://reader033.vdocument.in/reader033/viewer/2022052708/5a4d1b5b7f8b9ab0599ab334/html5/thumbnails/10.jpg)
Extraction Example 1: Gene Name Recognition
… expression of terminal gap genes is mediated by the local activation of the Torso receptor tyrosine kinase (Tor). At the anterior, terminal gap genes are also activated by the Tor pathway but Bcd contributes to their activation.
10
Gene?
Gene? Gene?
![Page 11: BeeSpace Informatics Research ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library](https://reader033.vdocument.in/reader033/viewer/2022052708/5a4d1b5b7f8b9ab0599ab334/html5/thumbnails/11.jpg)
Features for Recognizing Genes
• Syntactic clues:– Capitalization (especially acronyms)– Numbers (gene families)– Punctuation: -, /, :, etc.
• Contextual clues:– Local: surrounding words such as “gene”,
“encoding”, “regulation”, “expressed”, etc.– Global: same noun phrase occurs several times in
the same article
11
![Page 12: BeeSpace Informatics Research ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library](https://reader033.vdocument.in/reader033/viewer/2022052708/5a4d1b5b7f8b9ab0599ab334/html5/thumbnails/12.jpg)
Maximum Entropy Modelfor Gene Tagging
• Given an observation (a token or a noun phrase), together with its context, denoted as x
• Predict y {gene, non-gene}
• Maximum entropy model:
P(y|x) = K exp(ifi(x, y))
• Typical f:– y = gene & candidate phrase starts with a capital letter– y = gene & candidate phrase contains digits
• Estimate i with training data
12
![Page 13: BeeSpace Informatics Research ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library](https://reader033.vdocument.in/reader033/viewer/2022052708/5a4d1b5b7f8b9ab0599ab334/html5/thumbnails/13.jpg)
Special Challenges
• Gene name disambiguation
• Domain adaptation
13
![Page 14: BeeSpace Informatics Research ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library](https://reader033.vdocument.in/reader033/viewer/2022052708/5a4d1b5b7f8b9ab0599ab334/html5/thumbnails/14.jpg)
Gene Name Disambiguation
• Gene names can be common English words: for (foraging), in (inturned), similar (sima),
yellow (y), black (b)…
• Solution: – Disambiguate by looking at the context of the
candidate word – Train a classifier
14
![Page 15: BeeSpace Informatics Research ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library](https://reader033.vdocument.in/reader033/viewer/2022052708/5a4d1b5b7f8b9ab0599ab334/html5/thumbnails/15.jpg)
Discriminative Neighbor Words
15
![Page 16: BeeSpace Informatics Research ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library](https://reader033.vdocument.in/reader033/viewer/2022052708/5a4d1b5b7f8b9ab0599ab334/html5/thumbnails/16.jpg)
Sample Disambiguation Results
16
... affect complex behaviors such as locomotion and foraging. The foraging -1.468 +3.359(for) gene encodes a pkg in drosophila melanogaster here we demonstrate a +5.497 function for the for gene in sensory responsiveness and … -0.582 +5.980
the cuticular melanization phenotype of black flies is rescued by beta-alanine but -2.780 beta-alanine production by aspartate decarboxylation was reported to be normal in assays of black mutants and although … +9.759
“foraging”, “for”
“black”
![Page 17: BeeSpace Informatics Research ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library](https://reader033.vdocument.in/reader033/viewer/2022052708/5a4d1b5b7f8b9ab0599ab334/html5/thumbnails/17.jpg)
Nov 27, 2007 17
Problem of Domain Overfitting
gene name recognizer 54.1%
gene name recognizer 28.1%
ideal setting
realistic settingwingless
daughterless
eyeless
apexless…
fly
![Page 18: BeeSpace Informatics Research ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library](https://reader033.vdocument.in/reader033/viewer/2022052708/5a4d1b5b7f8b9ab0599ab334/html5/thumbnails/18.jpg)
Solution: Learn Generalizable Features…decapentaplegic and wingless are expressed in
analogous patterns in each primordium of…
…that CD38 is expressed by both neurons and glial
cells…that PABPC5 is expressed in fetal brain and in
a range of adult tissues.
18
Generalizable Feature: “w+2 = expressed”
![Page 19: BeeSpace Informatics Research ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library](https://reader033.vdocument.in/reader033/viewer/2022052708/5a4d1b5b7f8b9ab0599ab334/html5/thumbnails/19.jpg)
Generalizability-Based Feature Ranking
…training
data
……-less……expressed……
………expressed………-less
………expressed……-less…
…………expressed……-less
…
12345678
12345678
12345678
12345678
…expressed………-less……
…0.125………0.167…… 19
![Page 20: BeeSpace Informatics Research ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library](https://reader033.vdocument.in/reader033/viewer/2022052708/5a4d1b5b7f8b9ab0599ab334/html5/thumbnails/20.jpg)
20
Effectiveness of Domain Adaptation
Fly + Mouse Yeastgene name recognizer 63.3%
Fly + Mouse Yeastgene name recognizer 75.9%
standard learning
domain adaptive learning
![Page 21: BeeSpace Informatics Research ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library](https://reader033.vdocument.in/reader033/viewer/2022052708/5a4d1b5b7f8b9ab0599ab334/html5/thumbnails/21.jpg)
More Results on Domain AdaptationExp Method Precision Recall F1
F+M→Y Baseline 0.557 0.466 0.508Domain 0.575 0.516 0.544
% Imprv. +3.2% +10.7% +7.1%F+Y→M Baseline 0.571 0.335 0.422
Domain 0.582 0.381 0.461% Imprv. +1.9% +13.7% +9.2%
M+Y→F Baseline 0.583 0.097 0.166Domain 0.591 0.139 0.225
% Imprv. +1.4% +43.3% +35.5%
•Text data from BioCreAtIvE (Medline)•3 organisms (Fly, Mouse, Yeast) 21
![Page 22: BeeSpace Informatics Research ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library](https://reader033.vdocument.in/reader033/viewer/2022052708/5a4d1b5b7f8b9ab0599ab334/html5/thumbnails/22.jpg)
Extraction Example 2: Genetic Interaction Relation
22
Gene
Gene
Is there a genetic interaction relation here?
Bcd regulates the expression of the maternal and zygotic gene hunchback (hb) that shows a step-like-function expression pattern, in the anterior half of the egg.
![Page 23: BeeSpace Informatics Research ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library](https://reader033.vdocument.in/reader033/viewer/2022052708/5a4d1b5b7f8b9ab0599ab334/html5/thumbnails/23.jpg)
Challenges
• No/little training data
• What features to use?
23
![Page 24: BeeSpace Informatics Research ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library](https://reader033.vdocument.in/reader033/viewer/2022052708/5a4d1b5b7f8b9ab0599ab334/html5/thumbnails/24.jpg)
Solution: Pseudo Training Data
24
Gene:
Bcd +
These results uncovered an antagonism between hunchback and bicoid at the anterior pole, whereas the two genes are
known to act in concert for most anterior segmented development.
![Page 25: BeeSpace Informatics Research ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library](https://reader033.vdocument.in/reader033/viewer/2022052708/5a4d1b5b7f8b9ab0599ab334/html5/thumbnails/25.jpg)
Pseudo Training Data Works Reasonably Well
25
Precision
Recall
Using all features works the best
![Page 26: BeeSpace Informatics Research ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library](https://reader033.vdocument.in/reader033/viewer/2022052708/5a4d1b5b7f8b9ab0599ab334/html5/thumbnails/26.jpg)
Large-Scale Entity/Relation Extraction
• Entity annotation
• Relation extraction
Entity Type Resource MethodGene NCBI, FlyBase, … Dictionary string search +
machine learningAnatomy FlyBase Dictionary string searchChemical MeSH, Biosis, … Dictionary string searchBehavior “x x behavior” pattern search
Relation Type MethodRegulatory Pre-defined pattern + machine learningExpressed In Co-occurrence + relevant keywords
Gene Behavior Co-occurrenceGene Chemical Co-occurrence
53
![Page 27: BeeSpace Informatics Research ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library](https://reader033.vdocument.in/reader033/viewer/2022052708/5a4d1b5b7f8b9ab0599ab334/html5/thumbnails/27.jpg)
Part 2: Semantic Navigation
27
![Page 28: BeeSpace Informatics Research ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library](https://reader033.vdocument.in/reader033/viewer/2022052708/5a4d1b5b7f8b9ab0599ab334/html5/thumbnails/28.jpg)
Space-Region Navigation
Literature Spaces
Bee Fly
Behavior
Bird…
Topic Regions
Bee Forager
MAP MAP
Bird Singing
EXTRACT
…Fly Rover
EXTRACT
SWITCHING
Intersection, Union,…
Intersection, Union,…
My Regions/Topics
My Spaces
28
![Page 29: BeeSpace Informatics Research ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library](https://reader033.vdocument.in/reader033/viewer/2022052708/5a4d1b5b7f8b9ab0599ab334/html5/thumbnails/29.jpg)
General Approach: Language Models
• Topic = word distribution
• Modeling text in a space with mixture models of multinomial distributions
• Text Mining = Parameter Estimation + Inferences
• Matching = Computer similarity between word distributions
• Users can “control” a model by specifying topic preferences
29
![Page 30: BeeSpace Informatics Research ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library](https://reader033.vdocument.in/reader033/viewer/2022052708/5a4d1b5b7f8b9ab0599ab334/html5/thumbnails/30.jpg)
A Sample Topic & Corresponding Space
filaments 0.0410238muscle 0.0327107actin 0.0287701z 0.0221623filament 0.0169888myosin 0.0153909thick 0.00968766thin 0.00926895sections 0.00924286er 0.00890264band 0.00802833muscles 0.00789018antibodies 0.00736094myofibrils 0.00688588flight 0.00670859images 0.00649626
actin filamentsflight muscleflight muscles
labels
• actin filaments in honeybee-flight muscle move collectively• arrangement of filaments and cross-links in the bee flight muscle z disk by image analysis of oblique sections• identification of a connecting filament protein in insect fibrillar flight muscle• the invertebrate myosin filament subfilament arrangement of the solid filaments of insect flight muscles• structure of thick filaments from insect flight muscle
Word Distribution (language model)
Example documents
Meaningful labels
30
![Page 31: BeeSpace Informatics Research ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library](https://reader033.vdocument.in/reader033/viewer/2022052708/5a4d1b5b7f8b9ab0599ab334/html5/thumbnails/31.jpg)
MAP: Topic/RegionSpace
• MAP: Use the topic/region description as a query to search a given space
• Retrieval algorithm:– Query word distribution: p(w|Q)
– Document word distribution: p(w|D)
– Score a document based on similarity of Q and D
• Leverage existing retrieval toolkits: Lemur/Indri
Vocabularyw D
QQDQ wp
wpwpDDQscore
)|()|(
log)|()||(),(
31
![Page 32: BeeSpace Informatics Research ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library](https://reader033.vdocument.in/reader033/viewer/2022052708/5a4d1b5b7f8b9ab0599ab334/html5/thumbnails/32.jpg)
EXTRACT: Space Topic/Region
• Assume k topics, each being represented by a word distribution
• Use a k-component mixture model to fit the documents in a given space (EM algorithm)
• The estimated k component word distributions are taken as k topic regions
| |
1 1
log ( | ) log[ ( | ) (1 ) ( | )]D k
i B j i jD C i j
p C p D p D
Likelihood:
Maximum likelihood estimator: * arg max ( | )p C
Bayesian estimator: * arg max ( | ) arg max ( | ) ( )p C p C p 32
![Page 33: BeeSpace Informatics Research ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library](https://reader033.vdocument.in/reader033/viewer/2022052708/5a4d1b5b7f8b9ab0599ab334/html5/thumbnails/33.jpg)
User-Controlled Exploration: Sample Topic 1
age 0.0672687division 0.0551497labor 0.052136colony 0.038305foraging 0.0357817foragers 0.0236658workers 0.0191248task 0.0190672behavioral 0.0189017behavior 0.0168805older 0.0143466tasks 0.013823old 0.011839individual 0.0114329ages 0.0102134young 0.00985875genotypic 0.00963096social 0.00883439
Prior:
labor 0.2division 0.2
33
![Page 34: BeeSpace Informatics Research ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library](https://reader033.vdocument.in/reader033/viewer/2022052708/5a4d1b5b7f8b9ab0599ab334/html5/thumbnails/34.jpg)
behavioral 0.110674age 0.0789419maturation 0.057956task 0.0318285division 0.0312101labor 0.0293371workers 0.0222682colony 0.0199028social 0.0188699behavior 0.0171008performance 0.0117176foragers 0.0110682genotypic 0.0106029differences 0.0103761polyethism 0.00904816older 0.00808171plasticity 0.00804363changes 0.00794045
Prior:
behavioral 0.2maturation 0.2
34
User-Controlled Exploration: Sample Topic 2
![Page 35: BeeSpace Informatics Research ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library](https://reader033.vdocument.in/reader033/viewer/2022052708/5a4d1b5b7f8b9ab0599ab334/html5/thumbnails/35.jpg)
foraging 0.290076nectar 0.114508food 0.106655forage 0.0734919colony 0.0660329pollen 0.0427706flower 0.0400582sucrose 0.0334728source 0.0319787behavior 0.0283774individual 0.028029rate 0.0242806recruitment 0.0200597time 0.0197362reward 0.0196271task 0.0182461sitter 0.00604067rover 0.00582791rovers 0.00306051
foraging 0.142473foragers 0.0582921forage 0.0557498food 0.0393453nectar 0.03217colony 0.019416source 0.0153349hive 0.0151726dance 0.013336forager 0.0127668information 0.0117961feeder 0.010944rate 0.0104752recruitment 0.00870751individual 0.0086414reward 0.00810706flower 0.00800705dancing 0.00794827behavior 0.00789228
Exploit Prior for Concept Switching
35
![Page 36: BeeSpace Informatics Research ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library](https://reader033.vdocument.in/reader033/viewer/2022052708/5a4d1b5b7f8b9ab0599ab334/html5/thumbnails/36.jpg)
Part 3: Entity Summarization
36
![Page 37: BeeSpace Informatics Research ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library](https://reader033.vdocument.in/reader033/viewer/2022052708/5a4d1b5b7f8b9ab0599ab334/html5/thumbnails/37.jpg)
Gene product
Expression
Sequence
Interactions
Mutations
General Functions
Multi-Aspect Gene Summary
Automated Gene Summarization?
![Page 38: BeeSpace Informatics Research ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library](https://reader033.vdocument.in/reader033/viewer/2022052708/5a4d1b5b7f8b9ab0599ab334/html5/thumbnails/38.jpg)
A Two-Stage Approach
![Page 39: BeeSpace Informatics Research ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library](https://reader033.vdocument.in/reader033/viewer/2022052708/5a4d1b5b7f8b9ab0599ab334/html5/thumbnails/39.jpg)
Text Summary of Gene Abl
![Page 40: BeeSpace Informatics Research ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library](https://reader033.vdocument.in/reader033/viewer/2022052708/5a4d1b5b7f8b9ab0599ab334/html5/thumbnails/40.jpg)
General Entity Summarizer
• Task: Given any entity and k aspects to summarize, generate a semi-structured summary
• Assumption: Training sentences available for each aspect
• Method: – Train a recognizer for each aspect – Given an entity, retrieve sentences relevant to the entity– Classify each sentence into one of the k aspects– Choose the best sentences in each category
40
![Page 41: BeeSpace Informatics Research ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library](https://reader033.vdocument.in/reader033/viewer/2022052708/5a4d1b5b7f8b9ab0599ab334/html5/thumbnails/41.jpg)
Further Generalizations
• Task: Given any entity and k pre-specified aspects to summarize, generate a semi-structured summary
• Assumption: Training sentences available for each aspect
• Method: – Train a recognizer for each aspect – Given an entity, retrieve sentences relevant to the entity– Classify each sentence into one of the k aspects– Choose the best sentences in each category
41
New method based on mixture modeland regularized optimization
![Page 42: BeeSpace Informatics Research ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library](https://reader033.vdocument.in/reader033/viewer/2022052708/5a4d1b5b7f8b9ab0599ab334/html5/thumbnails/42.jpg)
Part 4. Function Analysis
42
![Page 43: BeeSpace Informatics Research ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library](https://reader033.vdocument.in/reader033/viewer/2022052708/5a4d1b5b7f8b9ab0599ab334/html5/thumbnails/43.jpg)
Annotating Gene Lists: GO Terms vs. Literature MiningLimitations of GO annotations: - Labor-intensive- Limited Coverage
Literature Mining:- Automatic - Flexible exploration in the entire literature space
![Page 44: BeeSpace Informatics Research ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library](https://reader033.vdocument.in/reader033/viewer/2022052708/5a4d1b5b7f8b9ab0599ab334/html5/thumbnails/44.jpg)
For any term:
test its significance
Segmentation 56.0Pattern 34.2
Cell_cycle 25.6Development 22.1
Regulation 20.4…
Enriched concepts
Interactive analysis
Gene group
BcdCad…Tll
Entrez Gene
…
Document sets
For any gene:retrieve
its relevant documents
Bcd
Cad
Tll
Overview of Gene List Annotator
![Page 45: BeeSpace Informatics Research ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library](https://reader033.vdocument.in/reader033/viewer/2022052708/5a4d1b5b7f8b9ab0599ab334/html5/thumbnails/45.jpg)
Intuition for Literature-based Annotation
Gene TPI1 GPM1 PGK1 TDH3 TDH2
protein_kinase 0 0 2 0 0
decarboxylase 10 0 10 7 6
protein 39 26 65 44 33
stationary_phase 2 7 3 4 2
energy_metabolism 4 5 5 8 0
oscillation 0 0 0 0 1
![Page 46: BeeSpace Informatics Research ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library](https://reader033.vdocument.in/reader033/viewer/2022052708/5a4d1b5b7f8b9ab0599ab334/html5/thumbnails/46.jpg)
Likelihood Ratio Test with 2-Poisson Mixture Model
Dataset distribution: Poisson(λ;d)
Reference distribution: Poisson(λ0;d)
![Page 47: BeeSpace Informatics Research ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library](https://reader033.vdocument.in/reader033/viewer/2022052708/5a4d1b5b7f8b9ab0599ab334/html5/thumbnails/47.jpg)
Agreement with GO-based Method• Gene List: 93 genes up-regulated by the manganese treatment
GO Theme Related Annotator terms
neurogenesis axon guidance, growth cone,commissural axon, proneural gene
synaptic transmission synaptic vesicle, neurotransmitterrelease, synaptic transmission, sodiumchannel
cytoskeletal protein alpha tubulin, actin filament
cell communication tight junction, heparan sulfateproteoglycan
47
![Page 48: BeeSpace Informatics Research ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library](https://reader033.vdocument.in/reader033/viewer/2022052708/5a4d1b5b7f8b9ab0599ab334/html5/thumbnails/48.jpg)
Discovering Novel Themes• Gene List: 69 genes up-regulated by the methoprene treatment
Theme Annotator terms
muscle flight muscle, muscle myosin, nonmusclemyosin, light chain, myosin ii, thickfilament, thin filament, striated muscle
synaptic transmission neurotransmitter release, synaptictransmission, synaptic vesicle
signaling pathway notch signal
48
![Page 49: BeeSpace Informatics Research ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library](https://reader033.vdocument.in/reader033/viewer/2022052708/5a4d1b5b7f8b9ab0599ab334/html5/thumbnails/49.jpg)
Summary
Literature Text
Search Engine
Words/Phrases Entities Relations
Natural Language Understanding
UsersFunction Annotator
Space/Region Manager, Navigation Support
Gene Summarizer
Relational Database
Text Miner
Meta Data
Knowledge Discovery
&Hypothesis
Testing
InformationAccess &
Exploration
ContentAnalysis
QuestionAnswering
Part 1. Information Extraction
Part 2. Navigation Support
Part 3. EntitySummarization
Part 4. Function Analysis
49
Machine Learning + Language Models + Minimum Human Effort
General and scalable, but there’s room for deeper semantics
![Page 50: BeeSpace Informatics Research ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library](https://reader033.vdocument.in/reader033/viewer/2022052708/5a4d1b5b7f8b9ab0599ab334/html5/thumbnails/50.jpg)
Looking Ahead…
• Knowledge integration, inferences
• Support for hypothesis formulation and testing
50
![Page 51: BeeSpace Informatics Research ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library](https://reader033.vdocument.in/reader033/viewer/2022052708/5a4d1b5b7f8b9ab0599ab334/html5/thumbnails/51.jpg)
51
Exploring Knowledge Space
Gene A2
Gene A1
Gene A4
Gene A3
Gene A4’
Gene A1’
Behavior B4Behavior B3
Behavior B2
Behavior B1
isa isaCo-occur-fly
Orth-mosCo-occur-mos
Co-occur-bee
Co-occur-fly
Regorth
RegReg
1.X=NeighborOf(B4, Behavior, {co-occur,isa}) {B1,B2,B3}2. Y=NeighborOf(X, Gene, {c-occur, orth} {A1,A1’,A2,A3}3. Y=Y + {A5, A6} {A1,A1’, A2, A3,A5,A6}4. Z=NeighborOf(Y, Gene, {reg}) {A4, A4’}
Gene A5Reg
P= PathBetween({Z, B4, {co-occur, reg,isa})
![Page 52: BeeSpace Informatics Research ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library](https://reader033.vdocument.in/reader033/viewer/2022052708/5a4d1b5b7f8b9ab0599ab334/html5/thumbnails/52.jpg)
52
Full-Fledged BeeSpace V5
BiomedicalLiterature
Entities - Gene- Behavior- Anatomy- ChemicalRelations -Orthology- Regulatory interaction- …
ExperimentData
Analysis
Additional entities and relations
Expert knowledge
InferencesHypothesis Formulation & Testing
![Page 53: BeeSpace Informatics Research ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library](https://reader033.vdocument.in/reader033/viewer/2022052708/5a4d1b5b7f8b9ab0599ab334/html5/thumbnails/53.jpg)
Thanks to
Xin He (UIUC)Jing Jiang (SMU)Yanen Li (UIUC)Xu Ling (UIUC)Yue Lu (UIUC)
Qiaozhu Mei (UIUC/Michigan)
& Bruce Schatz (PI, BeeSpace)53
![Page 54: BeeSpace Informatics Research ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library](https://reader033.vdocument.in/reader033/viewer/2022052708/5a4d1b5b7f8b9ab0599ab334/html5/thumbnails/54.jpg)
Thank You!
54