domain adaptation for biomedical information extraction

34
Domain Adaptation for Biomedical Information Extraction Jing Jiang BeeSpace Seminar Oct 17, 2007

Upload: lottie

Post on 11-Jan-2016

38 views

Category:

Documents


0 download

DESCRIPTION

Domain Adaptation for Biomedical Information Extraction. Jing Jiang BeeSpace Seminar Oct 17, 2007. Outline. Why do we need domain adaptation? Solutions: Intelligent learning methods Knowledge bases Expert supervision Connections with BeeSpace V4. Why do we need domain adaptation?. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Domain Adaptation for Biomedical Information Extraction

Domain Adaptation for Biomedical Information Extraction

Jing Jiang

BeeSpace SeminarOct 17, 2007

Page 2: Domain Adaptation for Biomedical Information Extraction

10/17/07 2

Outline

Why do we need domain adaptation? Solutions:

Intelligent learning methods Knowledge bases Expert supervision

Connections with BeeSpace V4

Page 3: Domain Adaptation for Biomedical Information Extraction

10/17/07 3

Why do we need domain adaptation? Many biomedical information extraction

problems are solved by supervised machine learning methods such as support vector machines (SVMs). Entity recognition Relation extraction Sentence categorization

In supervised machine learning, it is assumed that the training data and the test data have the same distribution.

Page 4: Domain Adaptation for Biomedical Information Extraction

10/17/07 4

Why do we need domain adaptation? Existing labeled training data is often limited to

certain domains. GENIA corpus human, blood cells, transcription factors PennBioIE Genetic variation in malignancy, Cytochrome

P450 inhibition Training data for sentence categorization in gene

summarizer fly Even when the training data is diverse (containing

multiple domains), it would still be nice to customize the classifier for the particular target domain that we are working on.

Page 5: Domain Adaptation for Biomedical Information Extraction

10/17/07 5

Why do we need domain adaptation?

NER Task Train → Test F1

to find PER, LOC, ORG from news text

NYT → NYT 0.855

Reuters → NYT 0.641

to find gene/protein from biomedical literature

mouse → mouse 0.541

fly → mouse 0.281

Page 6: Domain Adaptation for Biomedical Information Extraction

10/17/07 6

Solutions to domain adaptation Intelligent learning methods

Instance weighting Feature selection

Knowledge bases Expert supervision

thesis research

future work

discussion

Page 7: Domain Adaptation for Biomedical Information Extraction

10/17/07 7

Domain adaptive learning methods Two-stage approach Two frameworks

Instance weighting Feature selection

Use of unlabeled data

Page 8: Domain Adaptation for Biomedical Information Extraction

10/17/07 8

Intuition

SourceDomain Target

Domain

Page 9: Domain Adaptation for Biomedical Information Extraction

10/17/07 9

Goal

TargetDomain

SourceDomain

Page 10: Domain Adaptation for Biomedical Information Extraction

10/17/07 10

Start from the source domain

SourceDomain Target

Domain

Page 11: Domain Adaptation for Biomedical Information Extraction

10/17/07 11

Focus on the common part

SourceDomain Target

Domain

Page 12: Domain Adaptation for Biomedical Information Extraction

10/17/07 12

Pick up some part from the target domain

SourceDomain Target

Domain

Page 13: Domain Adaptation for Biomedical Information Extraction

10/17/07 13

Formal formulation?

SourceDomain Target

Domain

How to formally formulate these ideas?

Page 14: Domain Adaptation for Biomedical Information Extraction

10/17/07 14

Instance weighting

SourceDomain Target

Domain

instance space

(each point represents an example)

to assign different weights to different instances in the objective function

Page 15: Domain Adaptation for Biomedical Information Extraction

10/17/07 15

Instance weighting

Observationsource domain target domain

Page 16: Domain Adaptation for Biomedical Information Extraction

10/17/07 16

Instance weighting

Observationsource domain target domain

Page 17: Domain Adaptation for Biomedical Information Extraction

10/17/07 17

Instance weighting

Analysis of domain differencep(x, y)

p(x)p(y | x)

ps(y | x) ≠ pt(y | x)

ps(x) ≠ pt(x)

labeling difference instance difference

labeling adaptation instance adaptation?

Page 18: Domain Adaptation for Biomedical Information Extraction

10/17/07 18

Instance weighting

Three sets of instancesDs Dt, l Dt, u

?

);|(log)|()(maxarg

X

Y

*t

ytt dxxypxypxp

X Ds+ Dt,l+ Dt,u?

Page 19: Domain Adaptation for Biomedical Information Extraction

10/17/07 19

Instance weighting

Framework

)](log

);|(log)(1

);|(log1

);|(log1

[maxargˆ

,

,

1,,

1,,

1

p

xypyC

xypC

xypC

ut

lt

s

N

k y

tkk

utut

N

j

ti

ti

ltlt

N

i

si

siii

ss

Y

a flexible setup covering both standard methods and new domain adaptive methods

1,, utlts

labeled source data

labeled target data

unlabeled target data

Page 20: Domain Adaptation for Biomedical Information Extraction

10/17/07 20

Feature selection

SourceDomain Target

Domain

feature space

(each point represents a feature)

to identify features that behave similarly across domains

Page 21: Domain Adaptation for Biomedical Information Extraction

10/17/07 21

Feature selection

Observation Domain-specific features

wingless

daughterless

eyeless

apexless

“suffix -less” weighted high in the model trained from fly data

Useful for other organisms?

in general NO! May cause generalizable

features to be downweightedfly genes

Page 22: Domain Adaptation for Biomedical Information Extraction

10/17/07 22

Feature selection

Observation Generalizable features: generalize well in all

domains

…decapentaplegic and wingless are

expressed in analogous

patterns in each…

…that CD38 is expressed by both neurons and glial

cells…that PABPC5 is

expressed in fetal brain and in a range of adult

tissues.

fly mouse

Page 23: Domain Adaptation for Biomedical Information Extraction

10/17/07 23

Feature selection

Observation Generalizable features: generalize well in all

domains

…decapentaplegic and wingless are

expressed in analogous

patterns in each…

…that CD38 is expressed by both neurons and glial

cells…that PABPC5 is

expressed in fetal brain and in a range of adult

tissues.

fly mouse

“wi+2 = expressed” is generalizable

Page 24: Domain Adaptation for Biomedical Information Extraction

10/17/07 24

Feature selectionIntuition for identification of generalizable features

…source

domains

……-less……expressed……

………expressed………-less

………expressed……-less…

…………expressed……-less

12345678

12345678

12345678

12345678

…expressed………-less……

fly mouse D3 DK

Page 25: Domain Adaptation for Biomedical Information Extraction

10/17/07 25

Feature selection

Framework Matrix A is for feature selection

K

k

N

i

kTki

ki

k

K

k

ks

uv

k

k

k

uvAxypNK

uvuv

1 1

1

22

}{,

;|log11

minarg}{,

Page 26: Domain Adaptation for Biomedical Information Extraction

10/17/07 26

Feature selection results on gene/protein recognition

Page 27: Domain Adaptation for Biomedical Information Extraction

10/17/07 27

New directions to explore

Knowledge bases Expert supervision

Page 28: Domain Adaptation for Biomedical Information Extraction

10/17/07 28

Knowledge bases – entity recognition Well-documented nomenclatures

Fly, Mouse, Rat Help filter out false positives? Help select features?

Dictionaries of entities “Dictionary features” Automatic summarization of nomenclatures? Automatic identification of good features?

Page 29: Domain Adaptation for Biomedical Information Extraction

10/17/07 29

Knowledge bases – sentence categorization in gene summarizer For fly, the training sentences are

automatically extracted from FlyBase. For other organisms, do we have similar resources?

Page 30: Domain Adaptation for Biomedical Information Extraction

10/17/07 30

Expert supervision – entity recognition Computer system selects ambiguous

examples for human experts to judge. Computer system asks human experts other

questions. Similar organisms? Typical surface features? (e.g. cis-regulatory

elements, “-RE”) Computer system summarizes possible

features from pseudo labeled data, and asks human experts for confirmation.

Page 31: Domain Adaptation for Biomedical Information Extraction

10/17/07 31

Connections to BeeSpace V4

A major challenge in BeeSpace V4 is extraction of new types of entities and relations.

Exploiting knowledge bases and expert supervision is especially important.

For new types, no labeled data is available even from other domains. Use of bootstrapping methods should be explored.

Page 32: Domain Adaptation for Biomedical Information Extraction

10/17/07 32

New entity types

Recognition of many new types will be dictionary based: organism, anatomy, biological process, etc.

Recognition of some new types will need some NER techniques: chemical, regulatory element

Page 33: Domain Adaptation for Biomedical Information Extraction

10/17/07 33

New relation types

Bootstrapping (?) Seed patterns from knowledge bases or human

experts Human inspection of newly discovered patterns?

Page 34: Domain Adaptation for Biomedical Information Extraction

10/17/07 34

The end