domain adaptation with multiple sources

Domain Adaptation with Domain Adaptation with Multiple Sources Multiple Sources

Yishay Mansour, Tel Aviv Univ. & Google

Mehryar Mohri, NYU & Google

Afshin Rostami, NYU

3

Adaptation – motivation

• High level: – The ability to generalize from one domain to

another• Significance:

– Basic human property– Essential in most learning environments– Implicit in many applications.

4

Adaptation - examples

• Sentiment analysis:– Users leave reviews

• products, sellers, movies, …– Goal: score reviews as positive or negative.– Adaptation example:

• Learn for restaurants and airlines• Generalize to hotels

5

Adaptation - examples

• Speech recognition– Adaptation:

• Learn a few accents • Generalize to new accents

– think “foreign accents”.

6

Adaptation and generalization

• Machine Learning prediction:– Learn from examples drawn from distribution D– predict the label of unseen examples

• drawn from the same distribution D

– generalization within a distribution• Adaptation:

– predict the label of unseen examples• drawn from a different distribution D’

– Generalization across distributions

7

Adaptation – Related Work

• Learn from D and test on D’– relating the increase in error to dist(D,D’)

• Ben-David et al. (2006), Blitzer et al. (2007),

• Single distribution varying label quality• Cramer et al. (2005, 2006)

9

Our Model - input

f

target function

D1

Dk

distributions

.

.

.

h1

hk

hypotheses

.

.

.

L(D1,h1,f)≤ε

L(Dk,hk,f)≤ε

.

.

.

Expected Loss

Typical loss function: L(a,b)=|a-b| and L(D,h,f)= Ex~D[ |f(x)-h(x)| ]

10

Our Model – target distribution

D1

Dk

basicdistributions

.

.

.

target distribution Dλ

λ1

λk

k

iii xDxD

1

)()(

11

Our model – Combination Rule• Combine h1, … , hk to a

hypothesis h*

– Low expected loss• hopefully at most ε

• combining rules:– let z: Σ zi = 1 and zi≥ 0 – linear: h*(x) = Σ zi hi(x)– distribution weighted:

)()(

)()(1

1

xihxDz

xDzxhk

ik

j jj

iiz

h1 hk. . .

combining rule

12

Combining Rules – Pros

• Alternative: Build a dataset for the mixture.– Learning the mixture parameters is non-trivial– Combined data set might be huge size– Domain dependent data unavailable– Combined data might be huge

• Sometimes only classifiers are given/exist– privacy

• MOST IMPORTANT:FUNDAMENTAL THEORY QUESTION

13

Our Results:

• Linear Combining rule:– Seems like the first thing to try– Can be very bad

• Simple settings where any linear combining rule performs badly.

14

Our Results:

• Distribution weighted combining rules:– Given the mixture parameter λ:

• there is a good distribution weighted combining rule.• expected loss at most ε

– For any target function f, • there is a good distribution combining rule hz

• expected loss at most ε

– Extension for multiple “consistent” target functions• expected loss at most 3ε

• OUTCOME: This is the “right” hypothesis class

16

Linear combining rules

Xfh1h0

a110b010

DDaDb

a½10b½01

Original Loss: ε=0 !!!

Any linear combining rule hhas expected absolute loss ½

17

Distribution weighted combining rule

• Target distribution – a mixture: Dλ(x)=Σ λi Di(x)

• Set z=λ :

• Claim: L(Dλ,hλ,f) ≤ ε

)()()()(

)()()(

111

xihxDxDxih

xDxDxh

k

i T

iik

ik

j jj

ii

18

Distribution weighted combining rule

),,( fhDL

k

ii

k

i

fhDL

xiii

ix

k

i

ii

x

ii

xfxhLxD

xfxhLxDxDxD

xfxhLxD

1

1

),,(

1

))(),(()(

))(),(()()()(

))(),(()(

PROOF:

19

Back to the bad example

Xfh1h0

a110b010

DDaDb

a½10b½01

Original Loss: ε=0 !!!

h+(x):x=a h+(x)=h1(x)=1x=b h+(x)=h0(x)=0

21

Unknown mixture distribution

• Zero-sum game:– NATURE: selects a distribution Di

– LEARNER: selects a z• hypothesis hz

– Payoff: L(Di,hz,f)• Restating to previous result:

– For any mixed action λ of NATURE– LEARNER has a pure action z= λ

• such that the expected loss is at most ε

22

Unknown mixture distribution

• Consequence:– LEARNER has a mixed action (over z’s) – for any mixed action λ of NATURE

• a mixture distribution Dλ

– The loss is at most ε• Challenge:

– show a specific hypothesis hz• pure, not mixed, action

23

Searching for a good hypothesis

• Uniformly good hypothesis hz:– for any Di we have L(Di, hz,f) ≤ ε

• Assume all the hi are identical– Extremely lucky and unlikely case

• If we have a good hypothesis we are done!– L(Dλ,hz,f) = Σ λi L(Di,hz,f) ≤ Σ λi ε = ε

• We need to show in general a good hz !

24

Proof Outline:

• Balancing the losses:– Show that some hz has identical loss on any Di

– uses Brouwer Fixed Point Theorem• holds very generally

• Bounding the losses:– Show this hz has low loss for some mixture

• specifically Dz

25

A: compact and convex set

φ: A→Acontinuous mapping

Brouwer Fixed Point Theorem :For any convex and compact set A and any continuous mapping φ : A→A, there exists a point x in A such that φ(x)=x

26

Balancing Losses

A = {Σi zi = 1 and zi ≥ 0 }

k

j zjj

ziii

fhDLzfhDLzz

1),,(

),,()]([

Problem 1: Need to get φ continuous

27

Balancing Losses

A = {Σi zi = 1 and zi≥ 0 }

k

j zjj

ziii

fhDLzfhDLzz

1),,(

),,()]([

Fixed point: z=φ(z)

k

j zjj

ziii

fhDLzfhDLzz

1),,(

),,(

),,(),,(1

fhDLzfhDLzz ziik

j zjji

Problem 2:Needs that zi ≠0

28

Bounding the losses

• We can guarantee balanced losses even for linear combining rule !

Xfh1h0

a110b010

DDaDb

a½10b½01

For z=(½, ½) we haveL(Da,hz,f)=½L(Db,hz,f)=½

29

Bounding Losses

• Consider the previous z– from Brouwer fixed point theorem

• Consider the mixture Dz

– Expected loss is at most ε

• Also: L(Dz,hz,f)= ΣzjL(Dj,hz,f)=γ• Conclusion:

– For any mixture expected loss at most γ ≤ ε

30

Solving the problems:

• Redefine the distribution weighted rule:

• Claim: For any distribution D,is continuous in z.

)()(

/)()(1

, xhxDz

kxDzxh i

k

i jj

iiz

),,( , fhDL z

31

Main Theorem

For any target function f and any δ>0,there exists η>0 and z such thatfor any λ we have

),,( , fhDL z

32

Balancing Losses

• The set A = {Σ zi = 1 and zi≥ 0 }– The simplex

• The mapping φ with parameters η and η’– [φ(z)]i= (zi Li,z+η’/k)/ (ΣzjLj,z+η’)

• where Li,z=L(Di,hz,η,f)

• For some z in A we have φ(z)=z– zi = (zi Li,z+η’/k)/ (ΣzjLj,z+η’) >0

– Li,z = (ΣzjLj,z)+η’ - η’/(zi k) < (ΣzjLj,z)+ η’

33

Bounding Losses

• Consider the previous z– from Brouwer fixed point theorem

• Consider the mixture Dz

– Expected loss is at most ε+η

• By definition ΣzjLj,z= L(Dz,hz,η,f)

• Conclusion: γ=ΣzjLj,z ≤ ε+η

34

Putting it together

• There exists (z,η) such that:– Expected loss of hz,η approximately balanced

– L(Di,hz,η,f) ≤γ+η’

• Bounding γ using Dz

– γ =L(Dz,hz,η,f) ≤ε+η

• For any mixture Dλ

– L(Dλ,hz,η,f) ≤ε+η+ η’

35

A more general model• So far: NATURE first fixes target function f• consistent target functions f

– the expected loss w.r.t. Di is at most ε• for any of the k distributions

• Function class F ={f is consistent}• New Model:

– LEARNER picks a hypothesis h– NATURE picks f in F and mixture Dλ

– Loss L(Dλ,h,f)• RESULT: L(Dλ,h,f)≤ 3ε.

37

Uniform Algorithm

• Hypothesis sets z=(1/k , … , 1/k):

• Performance:– For any mixture, expected error ≤ kε– There exists mixture with expected error Ω(kε)– For k=2, there exists a mixture with 2ε-ε2

)()(

)()()()1(

)()1()(1

11

1

xihxD

xDxihxDk

xDkxhk

ik

j j

ik

ik

j j

iu

38

Open Problem

• Find a uniformly good hypothesis– efficiently !!!

• algorithmic issues:– Search over the z’s– Multiple local minima.

40

Empirical Results

• Data-set of sentiment analysis:– good product takes a little time to start operating very good for the

price a little trouble using it inside ca – it rocks man this is the rockinest think i've ever seen or buyed

dudes check it ou – does not retract agree with the prior reviewers i can not get it to

retract any longer and that was only after 3 uses – dont buy not worth a cent got it at walmart can't even remove a

scuff i give it 100 good thing i could return it – flash drive excelent hard drive good price and good time for seller

thanks

41

Empirical analysis

• Multiple domains:– dvd, books, electronics, kitchen appliance.

• Language model:– build a model for each domain

• unlike the theory, this is an additional error source

• Tested on mixture distribution– known mixture parameters

• Target: score (1-5)– error: Mean Square Error (MSE)

42

linearDistribution weightedbooks

dvdelectronics

kitchen

46

Summary

• Adaptation model– combining rules

• linear• distribution weighted

• Theoretical analysis– mixture distribution

• Future research– algorithms for combining rules– beyond mixtures

48

Adaptation – Our Model

• Input:– target function: f– k distributions D1, …, Dk

– k hypothesis: h1, …, hk

– For every i: L(Di,hi,f) ≤ε• where L(D,h,f) defines the expected loss

– think L(D,h,f)= Ex~D[ |f(x)-h(x)| ]

domain adaptation with multiple sources

Documents

distribution dpredict

distribution dilearner

linear combining rule

rule hzexpected loss

mixture parameters

bad exampleoriginal

mixed action of naturelearner

target function f