a two-stage approach to domain adaptation for statistical classifiers jing jiang & chengxiang...

A Two-Stage Approach to Domain Adaptation for Statistical Classifiers

Jing Jiang & ChengXiang Zhai

Department of Computer ScienceUniversity of Illinois at Urbana-Champaign

Nov 7, 2007 CIKM2007 2

What is domain adaptation?

Nov 7, 2007 CIKM2007 3

Example: named entity recognition

New York Times New York Times

train (labeled)

test (unlabeled)

NER Classifier

standard supervised learning 85.5%

persons, locations, organizations, etc.

Nov 7, 2007 CIKM2007 4

Example: named entity recognition

New York Times

NER Classifier

train (labeled)

test (unlabeled)

New York Times

labeled data not availableReuters

64.1%

persons, locations, organizations, etc.

non-standard (realistic) setting

Nov 7, 2007 CIKM2007 5

Domain difference performance drop

train test

NYT NYT

New York Times New York Times

NER Classifier 85.5%

Reuters NYT

Reuters New York Times

NER Classifier 64.1%

ideal setting

realistic setting

Nov 7, 2007 CIKM2007 6

Another NER example

train test

mouse mouse

gene name

recognizer

fly mouse

gene name

recognizer

ideal setting

realistic setting

54.1%

28.1%

Nov 7, 2007 CIKM2007 7

Other examples

Spam filtering: Public email collection personal inboxes

Sentiment analysis of product reviews Digital cameras cell phones Movies books

Can we do better than standard supervised learning?

Domain adaptation: to design learning methods that are aware of the training and test domain difference.

Nov 7, 2007 CIKM2007 8

How do we solve the problem in general?

Nov 7, 2007 CIKM2007 9

Observation 1domain-specific features

wingless

daughterless

eyeless

apexless

…

Nov 7, 2007 CIKM2007 10

Observation 1domain-specific features

wingless

daughterless

eyeless

apexless

…

• describing phenotype

• in fly gene nomenclature

• feature “-less” weighted high

CD38

PABPC5

…

feature still useful for other

organisms?

No!

Nov 7, 2007 CIKM2007 11

Observation 2generalizable features

…decapentaplegic and wingless are expressed in analogous patterns in each…

…that CD38 is expressed by both neurons and glial cells…that PABPC5 is expressed in fetal brain and in a range of adult tissues.

Nov 7, 2007 CIKM2007 12

Observation 2generalizable features

…decapentaplegic and wingless are expressed in analogous patterns in each…

…that CD38 is expressed by both neurons and glial cells…that PABPC5 is expressed in fetal brain and in a range of adult tissues.

feature “X be expressed”

Nov 7, 2007 CIKM2007 13

General idea: two-stage approach

SourceDomain Target

Domain

features

generalizable features

domain-specific features

Nov 7, 2007 CIKM2007 14

Goal

TargetDomain

SourceDomain

features

Nov 7, 2007 CIKM2007 15

Regular classification

SourceDomain Target

Domain

features

Nov 7, 2007 CIKM2007 16

Generalization: to emphasize generalizable features in the trained model

SourceDomain Target

Domain

features Stage 1

Nov 7, 2007 CIKM2007 17

Adaptation: to pick up domain-specific features for the target domain

SourceDomain Target

Domain

features Stage 2

Nov 7, 2007 CIKM2007 18

Regular semi-supervised learning

SourceDomain Target

Domain

features

Nov 7, 2007 CIKM2007 19

Comparison with related work

We explicitly model generalizable features. Previous work models it implicitly [Blitzer et al. 2006, Ben-

David et al. 2007, Daumé III 2007]. We do not need labeled target data but we need

multiple source (training) domains. Some work requires labeled target data [Daumé III 2007].

We have a 2nd stage of adaptation, which uses semi-supervised learning. Previous work does not incorporate semi-supervised

learning [Blitzer et al. 2006, Ben-David et al. 2007, Daumé III 2007].

Nov 7, 2007 CIKM2007 20

Implementation of the two-stage approach with logistic

regression classifiers

Nov 7, 2007 CIKM2007 21

x

Logistic regression classifiers01001::010

0.24.55

-0.33.0

::

2.1-0.90.4

-less

X be expressed

wyT x

'' )exp(

)exp(),|(

y

Ty

Tyypxw

xwwx

p binary features

… and wingless are expressed in…

wy

Nov 7, 2007 CIKM2007 22

Learning a logistic regression classifier

01001::010

0.24.55

-0.33.0

::

2.1-0.90.4

N

iy

Ty

Ty

N 1'

'

2

)exp(

)exp(log

1

minargˆ

xw

xw

www

log likelihood of training data

regularization term

penalize large weights

control model complexity

wyT x

Nov 7, 2007 CIKM2007 23

Generalizable features in weight vectors

0.24.55

-0.33.0

::

2.1-0.90.4

3.20.54.5-0.13.5

::

0.1-1.0-0.2

0.10.74.20.13.2

::

1.70.10.3

D1 D2 DK

w1 w2 wK

…

K source domains


domain-specific features

Nov 7, 2007 CIKM2007 24

We want to decompose w in this way

0.24.55

-0.33.0

::

2.1-0.90.4

00

4.60

3.2::000

0.24.50.4-0.3-0.2

::

2.1-0.90.4

= +

h non-zero entries for h


Nov 7, 2007 CIKM2007 25

Feature selection matrix A

0 0 1 0 0 … 00 0 0 0 1 … 0

::

0 0 0 0 0 … 1

01001::010

01::0

x

A z = Ax

h

matrix A selects h generalizable features

Nov 7, 2007 CIKM2007 26

0.24.55

-0.33.0

::

2.1-0.90.4

4.63.2

::

3.6

0.24.50.4-0.3-0.2

::

2.1-0.90.4

= +

Decomposition of w

01001::010

01::0

01001::010

wT x = vT z + uT x

weights for generalizable features

weights for domain-specific features

Nov 7, 2007 CIKM2007 27

Decomposition of w

wTx = vTz + uTx

= (Av)Tx + uTx

= vTAx + uTx

w = AT v + u

Nov 7, 2007 CIKM2007 28

0.24.55

-0.33.0

::

2.1-0.90.4

4.63.2

::

3.6

0.24.50.4-0.3-0.2

::

2.1-0.90.4

= +

0 0 … 00 0 … 01 0 … 00 0 … 00 1 … 0

::

0 0 … 00 0 … 00 0 … 0

Decomposition of wshared by all domainsdomain-specific

w = AT v + u

Nov 7, 2007 CIKM2007 29

Framework for generalization

Fix A, optimize:

wkregularization termλs >> 1: to penalize domain-specific features

SourceDomain

TargetDomain

k k

k

N

k

N

i

kTki

ki

k

K

k

ks

k

AypNK 1 1

1

22

}{,

);|(log11

minarg})ˆ{,ˆ(

uvx

uvuvuv

likelihood of labeled data from K source domains

Nov 7, 2007 CIKM2007 30

Framework for adaptation

);|(log1

);|(log1

1

1

minarg})ˆ{,ˆ,ˆ(

1

1 1

2

1

22

}{,

m

i

tTti

ti

N

k

N

i

kTki

ki

k

tt

K

k

ks

kt

Aypm

AypNK

k k

kt

uvx

uvx

uuvuuvuuv

SourceDomain

TargetDomain

likelihood of pseudo labeled target domain

examplesλt = 1 << λs : to pick up domain-specific

features in the target domain

Fix A, optimize:

Nov 7, 2007 CIKM2007 31

Joint optimization

Alternating optimization

k k

k

N

k

N

i

kTki

ki

k

K

k

ks

A,

k

AypNK

A

1 1

1

22

}{,

);|(log11

minarg})ˆ{,ˆ,ˆ(

uvx

uvuvuv

How to find A? (1)

Nov 7, 2007 CIKM2007 32

How to find A? (2)

Domain cross validation Idea: training on (K – 1) source domains and test

on the held-out source domain Approximation:

wfk: weight for feature f learned from domain k

wfk: weight for feature f learned from other domains

rank features by

See paper for details

K

k

kf

kf ww

1

Nov 7, 2007 CIKM2007 33

Intuition for domain cross validation

…domains

………expressed………-less

D1 D2 Dk-1 Dk (fly)

……-less……expressed……

w

1.5

0.05

w

2.0

1.2………expressed………-less

1.8

0.1

Nov 7, 2007 CIKM2007 34

Experiments

Data set BioCreative Challenge Task 1B Gene/protein name recognition 3 organisms/domains: fly, mouse and yeast

Experiment setup 2 organisms for training, 1 for testing F1 as performance measure

Nov 7, 2007 CIKM2007 35

Experiments: Generalization

Method F+MY M+YF Y+FM

BL 0.633 0.129 0.416

DA-1

(joint-opt)

0.627 0.153 0.425

DA-2

(domain CV)

0.654 0.195 0.470

SourceDomain

TargetDomain

SourceDomain

TargetDomain using generalizable features is effective

domain cross validation is more effective than joint optimization

F: fly M: mouse Y: yeast

Nov 7, 2007 CIKM2007 36

Experiments: Adaptation

Method F+MY M+YF Y+FM

BL-SSL 0.633 0.241 0.458

DA-2-SSL 0.759 0.305 0.501

SourceDomain

TargetDomain

F: fly M: mouse Y: yeast

domain-adaptive bootstrapping is more effective than regular bootstrapping

SourceDomain

TargetDomain

Nov 7, 2007 CIKM2007 37

Experiments: Adaptation

domain-adaptive SSL is more effective, especially with a small number of pseudo labels

Nov 7, 2007 CIKM2007 38

Conclusions and future work

Two-stage domain adaptation Generalization: outperformed standard supervised

learning Adaptation: outperformed standard bootstrapping

Two ways to find generalizable features Domain cross validation is more effective

Future work Single source domain? Setting parameters h and m

Nov 7, 2007 CIKM2007 39

References

S. Ben-David, J. Blitzer, K. Crammer & F. Pereira. Analysis of representations for domain adaptation. NIPS 2007.

J. Blitzer, R. McDonald & F. Pereira. Domain adaptation with structural correspondence learning. EMNLP 2006.

H. Daumé III. Frustratingly easy domain adaptation. ACL 2007.

Nov 7, 2007 CIKM2007 40

Thank you!

a two-stage approach to domain adaptation for statistical classifiers jing jiang & chengxiang...

Documents

domain adaptation

test domain difference

domain difference performance

urbanachampaign slide

test unlabeled new york

learning methods

glial cellsthat pabpc5

recognizer flymouse