beespace informatics research

54
BeeSpace Informatics Research ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library & Information Science University of Illinois at Urbana-Champaign BeeSpace Workshop, May 22, 2009 1

Upload: beatrice-barton

Post on 30-Dec-2015

32 views

Category:

Documents


0 download

DESCRIPTION

BeeSpace Informatics Research. ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library & Information Science University of Illinois at Urbana-Champaign. BeeSpace Workshop, May 22, 2009. Goal of Informatics Research. - PowerPoint PPT Presentation

TRANSCRIPT

BeeSpace Informatics Research

ChengXiang (“Cheng”) Zhai

Department of Computer Science

Institute for Genomic Biology

Statistics

Graduate School of Library & Information Science

University of Illinois at Urbana-Champaign

BeeSpace Workshop, May 22, 2009 1

Goal of Informatics Research

• Develop general and scalable computational methods to enable

– Semantic integration of data and information

– Effective information access and exploration

– Knowledge discovery

– Hypothesis formulation and testing

• Reinforcement of research in biology and computer science

– CS research to automate manual tasks of biologests

– Biology research to raise new challenges for CS

2

Overview of BeeSpace Technology

Literature Text

Search Engine

Words/Phrases Entities Relations

Natural Language Understanding

Users

Function Annotator

Space/Region Manager, Navigation Support

Gene Summarizer

Relational Database

Text Miner

Meta Data

Knowledge Discovery

& Hypothesis

Testing

InformationAccess &

Exploration

ContentAnalysis

QuestionAnswering

3

Informatics Research Accomplishments

Literature Text

Search Engine

Words/Phrases Entities Relations

Natural Language Understanding

Users

Function Annotator

Space/Region Manager, Navigation Support

Gene Summarizer

Relational Database

Text Miner

Meta Data

Knowledge Discovery

& Hypothesis Test

InformationAccess &

Exploration

ContentAnalysis

QuestionAnswering

Biomedical information retrieval [Jiang & Zhai 07], [Lu et al. 08]

Entity/Relation extraction [Jiang & Zhai 06], [Jiang & Zhai 07a], [Jiang & Zhai 07b]

Topic discovery and interpretation [Mei et al. 06a], [Mei et al. 07a], [Mei et al. 07b],

[Chee & Schatz 08]

Entity/Gene Summarization [Ling et al. 06], [Ling et al. 07], [Ling et al. 08]

Automatic Function Annotation [He et al. 09/10]

4

Overview of BeeSpace Technology

Literature Text

Search Engine

Words/Phrases Entities Relations

Natural Language Understanding

Users

Function Annotator

Space/Region Manager, Navigation Support

Gene Summarizer

Relational Database

Text Miner

Meta Data

Knowledge Discovery

&Hypothesis

Testing

InformationAccess &

Exploration

ContentAnalysis

QuestionAnswering

Part 1. Information Extraction

Part 2. Navigation Support

Part 3. EntitySummarization

Part 4. Function Analysis

5

Part 1. Information Extraction

6

Natural Language Understanding

…We have cloned and sequenced

a cDNA encoding Apis mellifera ultraspiracle (AMUSP)

and examined its responses to …

NP

NP NP

NPVP

VP VP

Gene Gene

7

Entity & Relation Extraction

Gene X Gene Y

Bcd hb

…. ….

… …

Genetic Interaction

Gene X Anatomy Y

Bcd embryo

Hb egg

… …

Expression Location

8

Lopes FJ et al., 2005 J. Theor. Biol.

General Approach: Machine Learning

• Computers learn from labeled examples to compute a function to predict labels of new examples

• Examples of predictions

– Given a phrase, predict whether it is a gene name

– Given a sentence with two gene names mentioned, predict whether there is a genetic interaction relation

• Many learning methods are available, but training data isn’t always available

9

Extraction Example 1: Gene Name Recognition

… expression of terminal gap genes is mediated by the local activation of the Torso receptor tyrosine kinase (Tor). At the anterior, terminal gap genes are also activated by the Tor pathway but Bcd contributes to their activation.

10

Gene?

Gene? Gene?

Features for Recognizing Genes

• Syntactic clues:

– Capitalization (especially acronyms)

– Numbers (gene families)

– Punctuation: -, /, :, etc.

• Contextual clues:

– Local: surrounding words such as “gene”, “encoding”, “regulation”, “expressed”, etc.

– Global: same noun phrase occurs several times in the same article

11

Maximum Entropy Modelfor Gene Tagging

• Given an observation (a token or a noun phrase), together with its context, denoted as x

• Predict y {gene, non-gene}

• Maximum entropy model:

P(y|x) = K exp(ifi(x, y))

• Typical f:

– y = gene & candidate phrase starts with a capital letter

– y = gene & candidate phrase contains digits

• Estimate i with training data

12

Special Challenges

• Gene name disambiguation

• Domain adaptation

13

Gene Name Disambiguation

• Gene names can be common English words:

for (foraging), in (inturned), similar (sima), yellow (y), black (b)…

• Solution:

– Disambiguate by looking at the context of the candidate word

– Train a classifier

14

Discriminative Neighbor Words

15

Sample Disambiguation Results

16

... affect complex behaviors such as locomotion and foraging. The foraging -1.468 +3.359

(for) gene encodes a pkg in drosophila melanogaster here we demonstrate a +5.497

function for the for gene in sensory responsiveness and … -0.582 +5.980

the cuticular melanization phenotype of black flies is rescued by beta-alanine but -2.780 beta-alanine production by aspartate decarboxylation was reported to be normal in

assays of black mutants and although … +9.759

“foraging”, “for”

“black”

Nov 27, 2007 17

Problem of Domain Overfitting

gene name recognizer 54.1%

gene name recognizer 28.1%

ideal setting

realistic setting

wingless

daughterless

eyeless

apexless

fly

Solution: Learn Generalizable Features…decapentaplegic and wingless are expressed in

analogous patterns in each primordium of…

…that CD38 is expressed by both neurons and glial

cells…that PABPC5 is expressed in fetal brain and in

a range of adult tissues.

18

Generalizable Feature: “w+2 = expressed”

Generalizability-Based Feature Ranking

…training

data

……-less……expressed……

………expressed………-less

………expressed……-less…

…………expressed……-less

12345678

12345678

12345678

12345678

…expressed………-less……

…0.125………0.167…… 19

20

Effectiveness of Domain Adaptation

Fly + Mouse Yeastgene name recognizer 63.3%

Fly + Mouse Yeastgene name recognizer 75.9%

standard learning

domain adaptive learning

More Results on Domain Adaptation

Exp Method Precision Recall F1

F+M→Y Baseline 0.557 0.466 0.508

Domain 0.575 0.516 0.544

% Imprv. +3.2% +10.7% +7.1%

F+Y→M Baseline 0.571 0.335 0.422

Domain 0.582 0.381 0.461

% Imprv. +1.9% +13.7% +9.2%

M+Y→F Baseline 0.583 0.097 0.166

Domain 0.591 0.139 0.225

% Imprv. +1.4% +43.3% +35.5%

•Text data from BioCreAtIvE (Medline)•3 organisms (Fly, Mouse, Yeast) 21

Extraction Example 2: Genetic Interaction Relation

22

Gene

Gene

Is there a genetic interaction relation here?

Bcd regulates the expression of the maternal and zygotic gene hunchback (hb) that shows a step-like-function expression pattern, in the anterior half of the egg.

Challenges

• No/little training data

• What features to use?

23

Solution: Pseudo Training Data

24

Gene:

Bcd +

These results uncovered an antagonism between hunchback and bicoid at the anterior pole, whereas the two genes are

known to act in concert for most anterior segmented development.

Pseudo Training Data Works Reasonably Well

25

Precision

Recall

Using all features works the best

Large-Scale Entity/Relation Extraction

• Entity annotation

• Relation extraction

Entity Type Resource MethodGene NCBI, FlyBase, … Dictionary string search +

machine learning

Anatomy FlyBase Dictionary string search

Chemical MeSH, Biosis, … Dictionary string search

Behavior “x x behavior” pattern search

Relation Type MethodRegulatory Pre-defined pattern + machine learning

Expressed In Co-occurrence + relevant keywords

Gene Behavior Co-occurrence

Gene Chemical Co-occurrence53

Part 2: Semantic Navigation

27

Space-Region Navigation

Literature Spaces

BeeFly

Behavior

Bird…

Topic Regions

Bee Forager

MAP MAP

Bird Singing

EXTRACT

…Fly Rover

EXTRACT

SWITCHING

Intersection, Union,…

Intersection, Union,…

My Regions/Topics

My Spaces

28

General Approach: Language Models

• Topic = word distribution

• Modeling text in a space with mixture models of multinomial distributions

• Text Mining = Parameter Estimation + Inferences

• Matching = Computer similarity between word distributions

• Users can “control” a model by specifying topic preferences

29

A Sample Topic & Corresponding Space

filaments 0.0410238muscle 0.0327107actin 0.0287701z 0.0221623filament 0.0169888myosin 0.0153909thick 0.00968766thin 0.00926895sections 0.00924286er 0.00890264band 0.00802833muscles 0.00789018antibodies 0.00736094myofibrils 0.00688588flight 0.00670859images 0.00649626

actin filamentsflight muscleflight muscles

labels

• actin filaments in honeybee-flight muscle move collectively• arrangement of filaments and cross-links in the bee flight muscle z disk by image analysis of oblique sections• identification of a connecting filament protein in insect fibrillar flight muscle• the invertebrate myosin filament subfilament arrangement of the solid filaments of insect flight muscles• structure of thick filaments from insect flight muscle

Word Distribution (language model)

Example documents

Meaningful labels

30

MAP: Topic/RegionSpace

• MAP: Use the topic/region description as a query to search a given space

• Retrieval algorithm:

– Query word distribution: p(w|Q)

– Document word distribution: p(w|D)

– Score a document based on similarity of Q and D

• Leverage existing retrieval toolkits: Lemur/Indri

Vocabularyw D

QQDQ wp

wpwpDDQscore

)|(

)|(log)|()||(),(

31

EXTRACT: Space Topic/Region

• Assume k topics, each being represented by a word distribution

• Use a k-component mixture model to fit the documents in a given space (EM algorithm)

• The estimated k component word distributions are taken as k topic regions

| |

1 1

log ( | ) log[ ( | ) (1 ) ( | )]D k

i B j i jD C i j

p C p D p D

Likelihood:

Maximum likelihood estimator: * arg max ( | )p C

Bayesian estimator: * arg max ( | ) arg max ( | ) ( )p C p C p 32

User-Controlled Exploration: Sample Topic 1

age 0.0672687division 0.0551497labor 0.052136colony 0.038305foraging 0.0357817foragers 0.0236658workers 0.0191248task 0.0190672behavioral 0.0189017behavior 0.0168805older 0.0143466tasks 0.013823old 0.011839individual 0.0114329ages 0.0102134young 0.00985875genotypic 0.00963096social 0.00883439

Prior:

labor 0.2division 0.2

33

behavioral 0.110674age 0.0789419maturation 0.057956task 0.0318285division 0.0312101labor 0.0293371workers 0.0222682colony 0.0199028social 0.0188699behavior 0.0171008performance 0.0117176foragers 0.0110682genotypic 0.0106029differences 0.0103761polyethism 0.00904816older 0.00808171plasticity 0.00804363changes 0.00794045

Prior:

behavioral 0.2maturation 0.2

34

User-Controlled Exploration: Sample Topic 2

foraging 0.290076nectar 0.114508food 0.106655forage 0.0734919colony 0.0660329pollen 0.0427706flower 0.0400582sucrose 0.0334728source 0.0319787behavior 0.0283774individual 0.028029rate 0.0242806recruitment 0.0200597time 0.0197362reward 0.0196271task 0.0182461sitter 0.00604067rover 0.00582791rovers 0.00306051

foraging 0.142473foragers 0.0582921forage 0.0557498food 0.0393453nectar 0.03217colony 0.019416source 0.0153349hive 0.0151726dance 0.013336forager 0.0127668information 0.0117961feeder 0.010944rate 0.0104752recruitment 0.00870751individual 0.0086414reward 0.00810706flower 0.00800705dancing 0.00794827behavior 0.00789228

Exploit Prior for Concept Switching

35

Part 3: Entity Summarization

36

Gene product

Expression

Sequence

Interactions

Mutations

General Functions

Multi-Aspect Gene Summary

Automated Gene Summarization?

A Two-Stage Approach

Text Summary of Gene Abl

General Entity Summarizer

• Task: Given any entity and k aspects to summarize, generate a semi-structured summary

• Assumption: Training sentences available for each aspect

• Method:

– Train a recognizer for each aspect

– Given an entity, retrieve sentences relevant to the entity

– Classify each sentence into one of the k aspects

– Choose the best sentences in each category

40

Further Generalizations

• Task: Given any entity and k pre-specified aspects to summarize, generate a semi-structured summary

• Assumption: Training sentences available for each aspect

• Method:

– Train a recognizer for each aspect

– Given an entity, retrieve sentences relevant to the entity

– Classify each sentence into one of the k aspects

– Choose the best sentences in each category

41

New method based on mixture modeland regularized optimization

Part 4. Function Analysis

42

Annotating Gene Lists: GO Terms vs. Literature Mining

Limitations of GO annotations: - Labor-intensive- Limited Coverage

Literature Mining:- Automatic - Flexible exploration in the entire literature space

For any term:

test its significance

Segmentation 56.0Pattern 34.2

Cell_cycle 25.6Development 22.1

Regulation 20.4…

Enriched concepts

Interactive analysis

Gene group

BcdCad…Tll

Entrez Gene

Document sets

For any gene:retrieve

its relevant documents

Bcd

Cad

Tll

Overview of Gene List Annotator

Intuition for Literature-based Annotation

Gene TPI1 GPM1 PGK1 TDH3 TDH2

protein_kinase 0 0 2 0 0

decarboxylase 10 0 10 7 6

protein 39 26 65 44 33

stationary_phase 2 7 3 4 2

energy_metabolism 4 5 5 8 0

oscillation 0 0 0 0 1

Likelihood Ratio Test with 2-Poisson Mixture Model

Dataset distribution: Poisson(λ;d)

Reference distribution: Poisson(λ0;d)

Agreement with GO-based Method• Gene List: 93 genes up-regulated by the manganese treatment

GO Theme Related Annotator terms

neurogenesis axon guidance, growth cone,

commissural axon, proneural gene

synaptic transmission synaptic vesicle, neurotransmitter

release, synaptic transmission, sodium

channel

cytoskeletal protein alpha tubulin, actin filament

cell communication tight junction, heparan sulfate

proteoglycan47

Discovering Novel Themes

• Gene List: 69 genes up-regulated by the methoprene treatment

Theme Annotator terms

muscle flight muscle, muscle myosin, nonmuscle

myosin, light chain, myosin ii, thick

filament, thin filament, striated muscle

synaptic transmission neurotransmitter release, synaptic

transmission, synaptic vesicle

signaling pathway notch signal

48

Summary

Literature Text

Search Engine

Words/Phrases Entities Relations

Natural Language Understanding

Users

Function Annotator

Space/Region Manager, Navigation Support

Gene Summarizer

Relational Database

Text Miner

Meta Data

Knowledge Discovery

&Hypothesis

Testing

InformationAccess &

Exploration

ContentAnalysis

QuestionAnswering

Part 1. Information Extraction

Part 2. Navigation Support

Part 3. EntitySummarization

Part 4. Function Analysis

49

Machine Learning + Language Models + Minimum Human Effort

General and scalable, but there’s room for deeper semantics

Looking Ahead…

• Knowledge integration, inferences

• Support for hypothesis formulation and testing

50

51

Exploring Knowledge Space

Gene A2

Gene A1

Gene A4

Gene A3

Gene A4’

Gene A1’

Behavior B4Behavior B3

Behavior B2

Behavior B1

isa isaCo-occur-fly

Orth-mosCo-occur-mos

Co-occur-bee

Co-occur-fly

Regorth

RegReg

1.X=NeighborOf(B4, Behavior, {co-occur,isa}) {B1,B2,B3}2. Y=NeighborOf(X, Gene, {c-occur, orth} {A1,A1’,A2,A3}3. Y=Y + {A5, A6} {A1,A1’, A2, A3,A5,A6}4. Z=NeighborOf(Y, Gene, {reg}) {A4, A4’}

Gene A5Reg

P= PathBetween({Z, B4, {co-occur, reg,isa})

52

Full-Fledged BeeSpace V5

BiomedicalLiterature

Entities - Gene- Behavior- Anatomy- ChemicalRelations -Orthology- Regulatory interaction- …

ExperimentData

Analysis

Additional entities and relations

Expert knowledge

InferencesHypothesis Formulation & Testing

Thanks to

Xin He (UIUC)Jing Jiang (SMU)Yanen Li (UIUC)Xu Ling (UIUC)Yue Lu (UIUC)

Qiaozhu Mei (UIUC/Michigan)

& Bruce Schatz (PI, BeeSpace)

53

Thank You!

54