domain adaptation

Tailoring Word Alignments to Syntactic Machine Translation

Domain AdaptationJohn Blitzer and Hal Daum III

TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAA1PhD, visiting researcher, PostdocAll on correspondences

Say: Correspondences for Domain Adaptation Correspondences for Machine Translation

Classical Single-domain LearningPredict:

Running with Scissors Title: Horrible book, horrible.This book was horrible. I read half, suffering from a headache the entire time, and eventually i lit it on fire. 1 less copy in the world. Don't waste your money. I wish i had the time spent reading this book back. It wasted my life

So the topic of ah the talk today is online learning

2Domain AdaptationSo the topic of ah the talk today is online learning

TrainingSource

Everything is happening online. Even the slides are produced on-line

TestingTarget

3Domain AdaptationPacked with fascinating infoNatural Language Processing

A breeze to clean up

Visual Object Recognition

4Domain AdaptationPacked with fascinating infoNatural Language Processing

A breeze to clean up

Visual Object Recognition

K. Saenko et al. Transferring Visual Category Models to New Domains. 2010.5Tutorial OutlineDomain Adaptation: Common ConceptsSemi-supervised AdaptationLearning with Shared SupportLearning Shared RepresentationsSupervised AdaptationFeature-Based ApproachesParameter-Based Approaches Open Questions and Uncovered Algorithms6Classical vs Adaptation Error

Classical Test Error:

Adaptation Target Error:Measured on thesame distribution!Measured on anew distribution!7

Common Concepts in AdaptationCovariate Shift

understands both &

Domain Discrepancy and ErrorEasyHardSingle Good Hypothesis

understands both &

8Tutorial OutlineNotation and Common ConceptsSemi-supervised AdaptationCovariate shift with Shared SupportLearning Shared RepresentationsSupervised AdaptationFeature-Based ApproachesParameter-Based ApproachesOpen Questions and Uncovered Algorithms9

A bound on the adaptation errorMinimize the total variation10Covariate Shift with Shared SupportAssumption: Target & Source Share Support

Reweight source instances to minimize discrepancy 11

Source Instance Reweighting

Defining ErrorUsing Definition of Expectation Multiplying by 1Rearranging

12Sample Selection BiasAnother Way to View Draw from the target

13Sample Selection BiasRedefine the source distribution

Draw from the targetSelect into the source with

14Rewriting Source Risk

Rearranging not dependent on x

15Logistic Model of Source SelectionTraining Data

16Selection Bias Correction AlgorithmInput:Labeled source data

17Selection Bias Correction AlgorithmInput:Labeled source dataUnlabeled target data

18Selection Bias Correction AlgorithmInput: Labeled source and unlabeled target data1) Label source instances as , target as

19Selection Bias Correction AlgorithmLabel source instances as , target as Train predictor

Input: Labeled source and unlabeled target data20Selection Bias Correction AlgorithmLabel source instances as , target as Train predictorReweight source instances

Input: Labeled source and unlabeled target data21Selection Bias Correction AlgorithmLabel source instances as , target as Train predictorReweight source instancesTrain target predictor

Input: Labeled source and unlabeled target data22How Bias gets Corrected23Rates for Re-weighted Learning

Adapted from Gretton et al.24Sample Selection Bias SummaryTwo Key Assumptions

Advantage

Optimal target predictor without labeled target data25Sample Selection Bias SummaryTwo Key Assumptions

Advantage

Disadvantage

26Sample Selection Bias References[1] J. Heckman. Sample Selection Bias as a Specification Error. 1979.

[2] A. Gretton et al. Covariate Shift by Kernel Mean Matching. 2008.

[3] C. Cortes et al. Sample Selection Bias Correction Theory. 2008

[4] S. Bickel et al. Discriminative Learning Under Covariate Shift. 2009.http://adaptationtutorial.blitzer.com/references/27Tutorial OutlineNotation and Common ConceptsSemi-supervised AdaptationCovariate shiftLearning Shared RepresentationsSupervised AdaptationFeature-Based ApproachesParameter-Based ApproachesOpen Questions and Uncovered Algorithms28

...

...??

?Unshared Support in the Real WorldRunning with ScissorsTitle: Horrible book, horrible.This book was horrible. I read half, suffering from a headache the entire time, and eventually i lit it on fire. 1 less copy in the world. Don't waste your money. I wish i had the time spent reading this book back. It wasted my life

...

...

Avante Deep Fryer; BlackTitle: lid does not work well...I love the way the Tefal deep fryer cooks, however, I am returning my second one due to a defective lid closure. The lid may close initially, but after a few uses it no longer stays closed. I wont be buying this one again.29

...

...??

?Unshared Support in the Real WorldRunning with ScissorsTitle: Horrible book, horrible.This book was horrible. I read half, suffering from a headache the entire time, and eventually i lit it on fire. 1 less copy in the world. Don't waste your money. I wish i had the time spent reading this book back. It wasted my life

...

...

Avante Deep Fryer; BlackTitle: lid does not work well...I love the way the Tefal deep fryer cooks, however, I am returning my second one due to a defective lid closure. The lid may close initially, but after a few uses it no longer stays closed. I wont be buying this one again.Error increase: 13% 26% 30Linear Regression for Rating Prediction

......0.5 1 -1.1 2 -0.3 0.1 1.5excellentgreatfascinating3 0 0 1 0 0 1......

31Coupled SubspacesNo Shared SupportSingle Good Linear Hypothesis

Stronger than32Coupled SubspacesNo Shared SupportSingle Good Linear HypothesisCoupled Representation Learning

33Single Good Linear Hypothesis?TargetSourceBooksKitchenBooks1.35Kitchen1.19Both

Adaptation Squared Error34Single Good Linear Hypothesis?TargetSourceBooksKitchenBooks1.35Kitchen1.19Both1.381.23

Adaptation Squared Error35Single Good Linear Hypothesis?TargetSourceBooksKitchenBooks1.351.68Kitchen1.801.19Both1.381.23

Adaptation Squared Error36

A bound on the adaptation errorA better discrepancy than total variation?What if a single good hypothesis exists?37A generalized discrepancy distanceMeasure how hypotheses make mistakes

Linear, binary discrepancy region38A generalized discrepancy distance

Measure how hypotheses make mistakes

lowlowhigh39Computable from finite samples.Discrepancy vs. Total VariationDiscrepancyTotal VariationNot computable in general

40Computable from finite samples.Discrepancy vs. Total VariationDiscrepancyTotal VariationNot computable in general

41Computable from finite samples.Discrepancy vs. Total VariationDiscrepancyTotal VariationNot computable in general

Low?42Computable from finite samples.Discrepancy vs. Total VariationDiscrepancyTotal VariationNot computable in general

High?43Computable from finite samples.Discrepancy vs. Total VariationDiscrepancyTotal VariationNot computable in general

Related to hypothesis classUnrelated to hypothesis classHigh?44Computable from finite samples.Discrepancy vs. Total VariationDiscrepancyTotal VariationNot computable in general

Related to hypothesis classUnrelated to hypothesis classBickel covariate shift algorithm heuristically minimizes both measuresHigh?45Is Discrepancy Intuitively Correct?4 domains: Books, DVDs, Electronics, KitchenB&D, E&K Shared VocabularyE&K: super easy, bad qualityTarget Error RateApproximate DiscrepancyB&D: fascinating, boring46Say Proxy (H\Delta H)-distance is correlated with the adaptation loss. In particular, note that the two edges from the graph which have low adaptation loss (high score) in fact have low proxy H\Delta H distance

A new adaptation bound

47Say: If there is some good hypothesis for both domains, then this bound tells how well you can learn from only source labeled data

Representations and the BoundLinear Hypothesis Class:Hypothesis classes from projections :

30

1001...

30

1001...

48Representations and the BoundLinear Hypothesis Class:

Hypothesis classes from projections :

30

1001...

Goals for

Minimize divergence

49

Learning Representations: Pivots

fascinatingboringread halfcouldnt put it down

defectivesturdyleakinglike a charmfantastichighly recommendedwaste of moneyhorrible

Pivot wordsSourceTarget

50Say "The pivots can be chosen automatically based on labeled source & unlabeled target data"

Predicting pivot word presence

Do not buyAn absolutely great purchaseA sturdy deep fryer

...?51


An absolutely great purchaseA sturdy deep fryer

Do not buy the Shark portable steamer. The trigger mechanism is defective. ...?52


An absolutely great purchase. . . . This blender is incredibly sturdy.A sturdy deep fryer

Do not buy the Shark portable steamer. The trigger mechanism is defective.

great great...Predict presence of pivot wordsgreat ?53Finding a shared sentiment subspace

highly recommend generates N new features : Did highly recommend appear? Sometimes predictors capture non-sentiment information

pivots

highlyrecommend highly recommend highly recommendgreat54Finding a shared sentiment subspacehighly recommendgreatI


pivots

highlyrecommend highly recommend 55Finding a shared sentiment subspacehighly recommendgreatwonderful


pivots

highlyrecommend highly recommend 56Finding a shared sentiment subspacehighly recommendgreatwonderful Let be a basis for the subspace of best fit to


pivots

highlyrecommend highly recommend

57Dont say you want to be wrong. Should be high.

Why do words appear? Syntactic, semantic, topical, & sentiment reasons. But were not interested in syntax. For example, it ignores positionExplicitly say *why* its denoisingFinding a shared sentiment subspace( highly recommend, great ) Let be a basis for the subspace of best fit to captures sentiment variance in

highly recommend

generates N new features : Did highly recommend appear? Sometimes predictors capture non-sentiment information

pivots

highlyrecommend highly recommend

58P projects onto shared subspace

Target

Source

59P projects onto shared subspace

Target

Source

60ComponentProjectionDiscrepancySource Huber LossTarget ErrorIdenitity1.7960.0030.253Correlating Pieces of the Bound

61Say Proxy (H\Delta H)-distance is correlated with the adaptation loss. In particular, note that the two edges from the graph which have low adaptation loss (high score) in fact have low proxy H\Delta H distanceComponentProjectionDiscrepancySource Huber LossTarget ErrorIdenitity1.7960.0030.253Random0.2230.2540.561Correlating Pieces of the Bound

62Say Proxy (H\Delta H)-distance is correlated with the adaptation loss. In particular, note that the two edges from the graph which have low adaptation loss (high score) in fact have low proxy H\Delta H distanceComponentProjectionDiscrepancySource Huber LossTarget ErrorIdenitity1.7960.0030.253Random0.2230.2540.561Coupled Projection0.2110.070.216Correlating Pieces of the Bound

63Say Proxy (H\Delta H)-distance is correlated with the adaptation loss. In particular, note that the two edges from the graph which have low adaptation loss (high score) in fact have low proxy H\Delta H distanceTarget Accuracy: Kitchen AppliancesSource Domain64Fix development here. Start with ideal bar on the end.Target Accuracy: Kitchen AppliancesSource Domain87.7%87.7%87.7%65Fix development here. Start with ideal bar on the end.Target Accuracy: Kitchen AppliancesSource Domain74.5%87.7%87.7%87.7%74.0%84.0%66Fix development here. Start with ideal bar on the end.Target Accuracy: Kitchen AppliancesSource Domain74.5%78.9%87.7%87.7%87.7%74.0%84.0%81.4%85.9%67Fix development here. Start with ideal bar on the end.Adaptation Error ReductionSource Domain74.5%78.9%87.7%87.7%87.7%74.0%84.0%81.4%85.9%36% reduction in error due to adaptation68Fix development here. Start with ideal bar on the end.negative vs. positiveplot_pagespredictablefascinatingengagingmust_readgrishamthe_plasticpoorly_designedleakingawkward_toespressoare_perfectyears_nowa_breezebookskitchenVisualizing (books & kitchen)

69Representation References[1] Blitzer et al. Domain Adaptation with Structural Correspondence Learning. 2006.

[2] S. Ben-David et al. Analysis of Representations for Domain Adaptation. 2007.

[3] J. Blitzer et al. Domain Adaptation for Sentiment Classification. 2008.

[4] Y. Mansour et al. Domain Adaptation: Learning Bounds and Algorithms. 2009.http://adaptationtutorial.blitzer.com/references/70Tutorial OutlineNotation and Common ConceptsSemi-supervised AdaptationCovariate shiftLearning Shared RepresentationsSupervised AdaptationFeature-Based ApproachesParameter-Based ApproachesOpen Questions and Uncovered Algorithms71Feature-based approachesCell-phone domain:horrible is badsmall is good

Hotel domain:horrible is badsmall is badKey Idea:Share some features (horrible)Don't share others (small)(and let an arbitrary learning algorithm decide which are which)Feature Augmentation The phone is smallThe hotel is smallS:theS:phoneS:isS:smallT:theT:hotelT:isT:smallIn feature-vector lingo:x x, x, 0 (for source domain)x x, 0, x (for target domain)OriginalFeaturesAugmentedFeaturesW:theW:phoneW:isW:smallW:theW:hotelW:isW:smallKaug(x,z) =2K(x,z) if x,z from same domain

K(x,z)otherwiseA Kernel PerspectiveIn feature-vector lingo:x x, x, 0 (for source domain)x x, 0, x (for target domain)We have ensured SGH & destroyed shared support Named Entity Rec.: /bush/

GeneralBC-newsConversationsNewswireWeblogsUsenetTelephonePersonGeo-politicalentityOrganizationLocationNamed Entity Rec.: p=/the/GeneralBC-newsConversationsNewswireWeblogsUsenetTelephonePersonGeo-politicalentityOrganizationLocation

Some experimental resultsTaskDomSrcOnlyTgtOnly BaselinePriorAugmentbn4.982.372.11 (pred)2.061.98bc4.544.073.53 (weight)3.473.47ACE-nw4.783.713.56 (pred)3.683.39NERwl2.452.452.12 (all)2.412.12un3.672.462.10 (linint)2.031.91cts2.080.460.40 (all)0.340.32CoNLLtgt2.492.951.75 (wgt/li)1.891.76PubMedtgt12.024.153.95 (linint)3.993.61CNNtgt10.293.823.44 (linint)3.353.37wsj6.634.354.30 (weight)4.274.11swbd315.904.154.09 (linint)3.603.51br-cf5.166.274.72 (linint)5.225.15Treebr-cg4.325.364.15 (all)4.254.90bank-br-ck5.056.325.01 (prd/li)5.275.41Chunkbr-cl5.666.605.39 (wgt/prd)5.995.73br-cm3.576.593.11 (all)4.084.89br-cn4.605.564.19 (prd/li)4.484.42br-cp4.825.624.55 (wgt/prd/li)4.874.78br-cr5.789.135.15 (linint)6.716.30Treebank-brown6.355.754.72 (linint)4.724.65Some TheoryCan bound expected target error:

Number of source examplesNumber of target examplesAverage training errorHD: make notation consistent with Johns78Feature HashingFeature augmentation creates (K+1)D parametersToo big if K>>20, but very sparse!1 2 0 4 0 11204 01 1204 01

Feat. Aug.2 1 4 4 1 1 3HashingHash KernelsHow much contamination between domains?

Hash vector fordomain uWeights excluding domain uTarget (low) dimensionality

Semi-sup Feature Augmentation For labeled data:(y, x) (y, x, x, 0) (for source domain)(y, x) (y, x, 0, x)(for target domain)What about unlabeled data?

Encourage agreement:

Semi-sup Feature Augmentation For labeled data:(y, x) (y, x, x, 0) (for source domain)(y, x) (y, x, 0, x)(for target domain)What about unlabeled data?(x) { (+1, 0, x, -x) , (-1, 0, x, -x) }

Encourages agreement on unlabeled dataAkin to multiview learningReduces generalization boundLoss on +veLoss on veLoss on unlabeledFeature-based ReferencesT. Evgeniou and M. Pontil. Regularized Multi-task Learning (2004).

H. Daum III, Frustratingly Easy Domain Adaptation. 2007.

K. Weinberger, A. Dasgupta, J. Langford, A. Smola, J. Attenberg. Feature Hashing for Large Scale Multitask Learning. 2009.

A. Kumar, A. Saha and H. Daum III, Frustratingly Easy Semi-Supervised Domain Adaptation. 2010.

Tutorial OutlineNotation and Common ConceptsSemi-supervised AdaptationCovariate shiftLearning Shared RepresentationsSupervised AdaptationFeature-Based ApproachesParameter-Based ApproachesOpen Questions and Uncovered Algorithms84A Parameter-based PerspectiveInstead of duplicating features, write:

And regularize and toward zero

NSharing Parameters via BayesN data pointsw is regularized to zeroGiven x and w, we predict ywyx0 NsSharing Parameters via BayesTrain model on source domain, regularized toward zero

Train model on target domain, regularized toward source domainw syx0 Ntw tyxw s

NsSharing Parameters via Bayesw syx Ntw t

yxw 0

0Separate weights for each domainEach regularized toward w0w0 regularizedtoward zero K NkSharing Parameters via Bayesw kyxw 0

0Separate weights for each domainEach regularized toward w0w0 regularizedtoward zero

Can derive augment as an approximation to this modelw0 ~ Nor(0, 2)wk ~ Nor(w0, 2)Non-linearize:w0 ~ GP(0, K)wk ~ GP(w0, K)Cluster Tasks:w0 ~ Nor(0, 2)wk ~ DP(Nor(w0, 2), )Not all domains created equalWould like to infer tree structure automaticallyTree structure should be good for the taskWant to simultaneously infer tree structure and parameters

Kingmans CoalescentA standard model for the genealogy of a populationEach organism has exactly one parent (haploid)Thus, the genealogy is a tree

A distribution over trees










Coalescent as a graphical model




Efficient InferenceConstruct trees in a bottom-up manner

Greedy: At each step, pick optimal pair (maximizes joint likelihood) and time to coalesce (branch length)Infer values of internal nodes by belief propagation

Not all domains created equalInference by EM:E: compute expectations over weightsM: maximize tree structure

K Nkw kyxw 0

0Message passing on coalescent tree; efficiently done by belief propagationGreedy agglomerative clustering algorithmData tree versus inferred tree

Some experimental results

Parameter-based ReferencesO. Chapelle and Z. Harchaoui. A machine learning approach to conjoint analysis. NIPS, 2005.

K. Yu, V. Tresp, and A. Schwaighofer. Learning Gaussian processes from multiple tasks. ICML, 2005.

Y. Xue, X. Liao, L. Carin, and B. Krishnapuram. Multi-task learning for classication with Dirichlet process priors. JMLR, 2007.

H. Daum III. Bayesian Multitask Learning with Latent Hierarchies. UAI 2009.Tutorial OutlineNotation and Common ConceptsSemi-supervised AdaptationCovariate shiftLearning Shared RepresentationsSupervised AdaptationFeature-Based ApproachesParameter-Based ApproachesOpen Questions and Uncovered Algorithms111

Theory and PracticeHypothesis classes from projections :

30

1001...

Minimize divergence

2) 112Open QuestionsMatching Theory and Practice Theory does not exactly suggest what practitioners doPrior Knowledge

F2-F1F1-F0/i//i//e//e/

Speech RecognitionCross-Lingual Grammar Projection113More Semi-supervised AdaptationSelf-training and Co-training [1] D. McClosky et al. Reranking and Self-Training for Parser Adaptation. 2006.

[2] K. Sagae & J. Tsuji. Dependency Parsing and Domain Adaptation with LR Models and Parser Ensembles. 2007.Structured Representation Learning[3] F. Huang and A. Yates. Distributional Representations for Handling Sparsity in Supervised Sequence Labeling. 2009.http://adaptationtutorial.blitzer.com/references/114What is a domain anyway?Time?News the day I was born vs news today?News yesterday vs news today?Space?News back home vs news in Haifa?News in Tel Aviv vs news in Haifa?

Do my data even come with a domain specified?Suggest a continuous structureStream of datawith y and d sometimes hidden?

Were all domains: personalizationadapt learn across millions of domains?share enough information to be useful?share little enough information to be safe?avoid negative transfer?avoid DAAM (domain adaptation spam)?

ThanksQuestions?117

domain adaptation

Documents

target source share

source distribution

adaptation target error

source instance reweighting

domain adaptationpacked

domain discrepancy

domain adaptationso

shared supportassumption