domain adaptation
DESCRIPTION
Domain Adaptation. John Blitzer and Hal Daumé III. TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A A A A A A A. Classical “Single-domain” Learning. Predict:. - PowerPoint PPT PresentationTRANSCRIPT
Tailoring Word Alignments to Syntactic Machine Translation
Domain AdaptationJohn Blitzer and Hal Daum III
TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAA1PhD, visiting researcher, PostdocAll on correspondences
Say: Correspondences for Domain Adaptation Correspondences for Machine Translation
Classical Single-domain LearningPredict:
Running with Scissors Title: Horrible book, horrible.This book was horrible. I read half, suffering from a headache the entire time, and eventually i lit it on fire. 1 less copy in the world. Don't waste your money. I wish i had the time spent reading this book back. It wasted my life
So the topic of ah the talk today is online learning
2Domain AdaptationSo the topic of ah the talk today is online learning
TrainingSource
Everything is happening online. Even the slides are produced on-line
TestingTarget
3Domain AdaptationPacked with fascinating infoNatural Language Processing
A breeze to clean up
Visual Object Recognition
4Domain AdaptationPacked with fascinating infoNatural Language Processing
A breeze to clean up
Visual Object Recognition
K. Saenko et al. Transferring Visual Category Models to New Domains. 2010.5Tutorial OutlineDomain Adaptation: Common ConceptsSemi-supervised AdaptationLearning with Shared SupportLearning Shared RepresentationsSupervised AdaptationFeature-Based ApproachesParameter-Based Approaches Open Questions and Uncovered Algorithms6Classical vs Adaptation Error
Classical Test Error:
Adaptation Target Error:Measured on thesame distribution!Measured on anew distribution!7
Common Concepts in AdaptationCovariate Shift
understands both &
Domain Discrepancy and ErrorEasyHardSingle Good Hypothesis
understands both &
8Tutorial OutlineNotation and Common ConceptsSemi-supervised AdaptationCovariate shift with Shared SupportLearning Shared RepresentationsSupervised AdaptationFeature-Based ApproachesParameter-Based ApproachesOpen Questions and Uncovered Algorithms9
A bound on the adaptation errorMinimize the total variation10Covariate Shift with Shared SupportAssumption: Target & Source Share Support
Reweight source instances to minimize discrepancy 11
Source Instance Reweighting
Defining ErrorUsing Definition of Expectation Multiplying by 1Rearranging
12Sample Selection BiasAnother Way to View Draw from the target
13Sample Selection BiasRedefine the source distribution
Draw from the targetSelect into the source with
14Rewriting Source Risk
Rearranging not dependent on x
15Logistic Model of Source SelectionTraining Data
16Selection Bias Correction AlgorithmInput:Labeled source data
17Selection Bias Correction AlgorithmInput:Labeled source dataUnlabeled target data
18Selection Bias Correction AlgorithmInput: Labeled source and unlabeled target data1) Label source instances as , target as
19Selection Bias Correction AlgorithmLabel source instances as , target as Train predictor
Input: Labeled source and unlabeled target data20Selection Bias Correction AlgorithmLabel source instances as , target as Train predictorReweight source instances
Input: Labeled source and unlabeled target data21Selection Bias Correction AlgorithmLabel source instances as , target as Train predictorReweight source instancesTrain target predictor
Input: Labeled source and unlabeled target data22How Bias gets Corrected23Rates for Re-weighted Learning
Adapted from Gretton et al.24Sample Selection Bias SummaryTwo Key Assumptions
Advantage
Optimal target predictor without labeled target data25Sample Selection Bias SummaryTwo Key Assumptions
Advantage
Disadvantage
26Sample Selection Bias References[1] J. Heckman. Sample Selection Bias as a Specification Error. 1979.
[2] A. Gretton et al. Covariate Shift by Kernel Mean Matching. 2008.
[3] C. Cortes et al. Sample Selection Bias Correction Theory. 2008
[4] S. Bickel et al. Discriminative Learning Under Covariate Shift. 2009.http://adaptationtutorial.blitzer.com/references/27Tutorial OutlineNotation and Common ConceptsSemi-supervised AdaptationCovariate shiftLearning Shared RepresentationsSupervised AdaptationFeature-Based ApproachesParameter-Based ApproachesOpen Questions and Uncovered Algorithms28
...
...??
?Unshared Support in the Real WorldRunning with ScissorsTitle: Horrible book, horrible.This book was horrible. I read half, suffering from a headache the entire time, and eventually i lit it on fire. 1 less copy in the world. Don't waste your money. I wish i had the time spent reading this book back. It wasted my life
...
...
Avante Deep Fryer; BlackTitle: lid does not work well...I love the way the Tefal deep fryer cooks, however, I am returning my second one due to a defective lid closure. The lid may close initially, but after a few uses it no longer stays closed. I wont be buying this one again.29
...
...??
?Unshared Support in the Real WorldRunning with ScissorsTitle: Horrible book, horrible.This book was horrible. I read half, suffering from a headache the entire time, and eventually i lit it on fire. 1 less copy in the world. Don't waste your money. I wish i had the time spent reading this book back. It wasted my life
...
...
Avante Deep Fryer; BlackTitle: lid does not work well...I love the way the Tefal deep fryer cooks, however, I am returning my second one due to a defective lid closure. The lid may close initially, but after a few uses it no longer stays closed. I wont be buying this one again.Error increase: 13% 26% 30Linear Regression for Rating Prediction
......0.5 1 -1.1 2 -0.3 0.1 1.5excellentgreatfascinating3 0 0 1 0 0 1......
31Coupled SubspacesNo Shared SupportSingle Good Linear Hypothesis
Stronger than32Coupled SubspacesNo Shared SupportSingle Good Linear HypothesisCoupled Representation Learning
33Single Good Linear Hypothesis?TargetSourceBooksKitchenBooks1.35Kitchen1.19Both
Adaptation Squared Error34Single Good Linear Hypothesis?TargetSourceBooksKitchenBooks1.35Kitchen1.19Both1.381.23
Adaptation Squared Error35Single Good Linear Hypothesis?TargetSourceBooksKitchenBooks1.351.68Kitchen1.801.19Both1.381.23
Adaptation Squared Error36
A bound on the adaptation errorA better discrepancy than total variation?What if a single good hypothesis exists?37A generalized discrepancy distanceMeasure how hypotheses make mistakes
Linear, binary discrepancy region38A generalized discrepancy distance
Measure how hypotheses make mistakes
lowlowhigh39Computable from finite samples.Discrepancy vs. Total VariationDiscrepancyTotal VariationNot computable in general
40Computable from finite samples.Discrepancy vs. Total VariationDiscrepancyTotal VariationNot computable in general
41Computable from finite samples.Discrepancy vs. Total VariationDiscrepancyTotal VariationNot computable in general
Low?42Computable from finite samples.Discrepancy vs. Total VariationDiscrepancyTotal VariationNot computable in general
High?43Computable from finite samples.Discrepancy vs. Total VariationDiscrepancyTotal VariationNot computable in general
Related to hypothesis classUnrelated to hypothesis classHigh?44Computable from finite samples.Discrepancy vs. Total VariationDiscrepancyTotal VariationNot computable in general
Related to hypothesis classUnrelated to hypothesis classBickel covariate shift algorithm heuristically minimizes both measuresHigh?45Is Discrepancy Intuitively Correct?4 domains: Books, DVDs, Electronics, KitchenB&D, E&K Shared VocabularyE&K: super easy, bad qualityTarget Error RateApproximate DiscrepancyB&D: fascinating, boring46Say Proxy (H\Delta H)-distance is correlated with the adaptation loss. In particular, note that the two edges from the graph which have low adaptation loss (high score) in fact have low proxy H\Delta H distance
A new adaptation bound
47Say: If there is some good hypothesis for both domains, then this bound tells how well you can learn from only source labeled data
Representations and the BoundLinear Hypothesis Class:Hypothesis classes from projections :
30
1001...
30
1001...
48Representations and the BoundLinear Hypothesis Class:
Hypothesis classes from projections :
30
1001...
Goals for
Minimize divergence
49
Learning Representations: Pivots
fascinatingboringread halfcouldnt put it down
defectivesturdyleakinglike a charmfantastichighly recommendedwaste of moneyhorrible
Pivot wordsSourceTarget
50Say "The pivots can be chosen automatically based on labeled source & unlabeled target data"
Predicting pivot word presence
Do not buyAn absolutely great purchaseA sturdy deep fryer
...?51
Predicting pivot word presence
An absolutely great purchaseA sturdy deep fryer
Do not buy the Shark portable steamer. The trigger mechanism is defective. ...?52
Predicting pivot word presence
An absolutely great purchase. . . . This blender is incredibly sturdy.A sturdy deep fryer
Do not buy the Shark portable steamer. The trigger mechanism is defective.
great great...Predict presence of pivot wordsgreat ?53Finding a shared sentiment subspace
highly recommend generates N new features : Did highly recommend appear? Sometimes predictors capture non-sentiment information
pivots
highlyrecommend highly recommend highly recommendgreat54Finding a shared sentiment subspacehighly recommendgreatI
highly recommend generates N new features : Did highly recommend appear? Sometimes predictors capture non-sentiment information
pivots
highlyrecommend highly recommend 55Finding a shared sentiment subspacehighly recommendgreatwonderful
highly recommend generates N new features : Did highly recommend appear? Sometimes predictors capture non-sentiment information
pivots
highlyrecommend highly recommend 56Finding a shared sentiment subspacehighly recommendgreatwonderful Let be a basis for the subspace of best fit to
highly recommend generates N new features : Did highly recommend appear? Sometimes predictors capture non-sentiment information
pivots
highlyrecommend highly recommend
57Dont say you want to be wrong. Should be high.
Why do words appear? Syntactic, semantic, topical, & sentiment reasons. But were not interested in syntax. For example, it ignores positionExplicitly say *why* its denoisingFinding a shared sentiment subspace( highly recommend, great ) Let be a basis for the subspace of best fit to captures sentiment variance in
highly recommend
generates N new features : Did highly recommend appear? Sometimes predictors capture non-sentiment information
pivots
highlyrecommend highly recommend
58P projects onto shared subspace
Target
Source
59P projects onto shared subspace
Target
Source
60ComponentProjectionDiscrepancySource Huber LossTarget ErrorIdenitity1.7960.0030.253Correlating Pieces of the Bound
61Say Proxy (H\Delta H)-distance is correlated with the adaptation loss. In particular, note that the two edges from the graph which have low adaptation loss (high score) in fact have low proxy H\Delta H distanceComponentProjectionDiscrepancySource Huber LossTarget ErrorIdenitity1.7960.0030.253Random0.2230.2540.561Correlating Pieces of the Bound
62Say Proxy (H\Delta H)-distance is correlated with the adaptation loss. In particular, note that the two edges from the graph which have low adaptation loss (high score) in fact have low proxy H\Delta H distanceComponentProjectionDiscrepancySource Huber LossTarget ErrorIdenitity1.7960.0030.253Random0.2230.2540.561Coupled Projection0.2110.070.216Correlating Pieces of the Bound
63Say Proxy (H\Delta H)-distance is correlated with the adaptation loss. In particular, note that the two edges from the graph which have low adaptation loss (high score) in fact have low proxy H\Delta H distanceTarget Accuracy: Kitchen AppliancesSource Domain64Fix development here. Start with ideal bar on the end.Target Accuracy: Kitchen AppliancesSource Domain87.7%87.7%87.7%65Fix development here. Start with ideal bar on the end.Target Accuracy: Kitchen AppliancesSource Domain74.5%87.7%87.7%87.7%74.0%84.0%66Fix development here. Start with ideal bar on the end.Target Accuracy: Kitchen AppliancesSource Domain74.5%78.9%87.7%87.7%87.7%74.0%84.0%81.4%85.9%67Fix development here. Start with ideal bar on the end.Adaptation Error ReductionSource Domain74.5%78.9%87.7%87.7%87.7%74.0%84.0%81.4%85.9%36% reduction in error due to adaptation68Fix development here. Start with ideal bar on the end.negative vs. positiveplot_pagespredictablefascinatingengagingmust_readgrishamthe_plasticpoorly_designedleakingawkward_toespressoare_perfectyears_nowa_breezebookskitchenVisualizing (books & kitchen)
69Representation References[1] Blitzer et al. Domain Adaptation with Structural Correspondence Learning. 2006.
[2] S. Ben-David et al. Analysis of Representations for Domain Adaptation. 2007.
[3] J. Blitzer et al. Domain Adaptation for Sentiment Classification. 2008.
[4] Y. Mansour et al. Domain Adaptation: Learning Bounds and Algorithms. 2009.http://adaptationtutorial.blitzer.com/references/70Tutorial OutlineNotation and Common ConceptsSemi-supervised AdaptationCovariate shiftLearning Shared RepresentationsSupervised AdaptationFeature-Based ApproachesParameter-Based ApproachesOpen Questions and Uncovered Algorithms71Feature-based approachesCell-phone domain:horrible is badsmall is good
Hotel domain:horrible is badsmall is badKey Idea:Share some features (horrible)Don't share others (small)(and let an arbitrary learning algorithm decide which are which)Feature Augmentation The phone is smallThe hotel is smallS:theS:phoneS:isS:smallT:theT:hotelT:isT:smallIn feature-vector lingo:x x, x, 0 (for source domain)x x, 0, x (for target domain)OriginalFeaturesAugmentedFeaturesW:theW:phoneW:isW:smallW:theW:hotelW:isW:smallKaug(x,z) =2K(x,z) if x,z from same domain
K(x,z)otherwiseA Kernel PerspectiveIn feature-vector lingo:x x, x, 0 (for source domain)x x, 0, x (for target domain)We have ensured SGH & destroyed shared support Named Entity Rec.: /bush/
GeneralBC-newsConversationsNewswireWeblogsUsenetTelephonePersonGeo-politicalentityOrganizationLocationNamed Entity Rec.: p=/the/GeneralBC-newsConversationsNewswireWeblogsUsenetTelephonePersonGeo-politicalentityOrganizationLocation
Some experimental resultsTaskDomSrcOnlyTgtOnly BaselinePriorAugmentbn4.982.372.11 (pred)2.061.98bc4.544.073.53 (weight)3.473.47ACE-nw4.783.713.56 (pred)3.683.39NERwl2.452.452.12 (all)2.412.12un3.672.462.10 (linint)2.031.91cts2.080.460.40 (all)0.340.32CoNLLtgt2.492.951.75 (wgt/li)1.891.76PubMedtgt12.024.153.95 (linint)3.993.61CNNtgt10.293.823.44 (linint)3.353.37wsj6.634.354.30 (weight)4.274.11swbd315.904.154.09 (linint)3.603.51br-cf5.166.274.72 (linint)5.225.15Treebr-cg4.325.364.15 (all)4.254.90bank-br-ck5.056.325.01 (prd/li)5.275.41Chunkbr-cl5.666.605.39 (wgt/prd)5.995.73br-cm3.576.593.11 (all)4.084.89br-cn4.605.564.19 (prd/li)4.484.42br-cp4.825.624.55 (wgt/prd/li)4.874.78br-cr5.789.135.15 (linint)6.716.30Treebank-brown6.355.754.72 (linint)4.724.65Some TheoryCan bound expected target error:
Number of source examplesNumber of target examplesAverage training errorHD: make notation consistent with Johns78Feature HashingFeature augmentation creates (K+1)D parametersToo big if K>>20, but very sparse!1 2 0 4 0 11204 01 1204 01
Feat. Aug.2 1 4 4 1 1 3HashingHash KernelsHow much contamination between domains?
Hash vector fordomain uWeights excluding domain uTarget (low) dimensionality
Semi-sup Feature Augmentation For labeled data:(y, x) (y, x, x, 0) (for source domain)(y, x) (y, x, 0, x)(for target domain)What about unlabeled data?
Encourage agreement:
Semi-sup Feature Augmentation For labeled data:(y, x) (y, x, x, 0) (for source domain)(y, x) (y, x, 0, x)(for target domain)What about unlabeled data?(x) { (+1, 0, x, -x) , (-1, 0, x, -x) }
Encourages agreement on unlabeled dataAkin to multiview learningReduces generalization boundLoss on +veLoss on veLoss on unlabeledFeature-based ReferencesT. Evgeniou and M. Pontil. Regularized Multi-task Learning (2004).
H. Daum III, Frustratingly Easy Domain Adaptation. 2007.
K. Weinberger, A. Dasgupta, J. Langford, A. Smola, J. Attenberg. Feature Hashing for Large Scale Multitask Learning. 2009.
A. Kumar, A. Saha and H. Daum III, Frustratingly Easy Semi-Supervised Domain Adaptation. 2010.
Tutorial OutlineNotation and Common ConceptsSemi-supervised AdaptationCovariate shiftLearning Shared RepresentationsSupervised AdaptationFeature-Based ApproachesParameter-Based ApproachesOpen Questions and Uncovered Algorithms84A Parameter-based PerspectiveInstead of duplicating features, write:
And regularize and toward zero
NSharing Parameters via BayesN data pointsw is regularized to zeroGiven x and w, we predict ywyx0 NsSharing Parameters via BayesTrain model on source domain, regularized toward zero
Train model on target domain, regularized toward source domainw syx0 Ntw tyxw s
NsSharing Parameters via Bayesw syx Ntw t
yxw 0
0Separate weights for each domainEach regularized toward w0w0 regularizedtoward zero K NkSharing Parameters via Bayesw kyxw 0
0Separate weights for each domainEach regularized toward w0w0 regularizedtoward zero
Can derive augment as an approximation to this modelw0 ~ Nor(0, 2)wk ~ Nor(w0, 2)Non-linearize:w0 ~ GP(0, K)wk ~ GP(w0, K)Cluster Tasks:w0 ~ Nor(0, 2)wk ~ DP(Nor(w0, 2), )Not all domains created equalWould like to infer tree structure automaticallyTree structure should be good for the taskWant to simultaneously infer tree structure and parameters
Kingmans CoalescentA standard model for the genealogy of a populationEach organism has exactly one parent (haploid)Thus, the genealogy is a tree
A distribution over trees
A distribution over trees
A distribution over trees
A distribution over trees
A distribution over trees
A distribution over trees
A distribution over trees
A distribution over trees
A distribution over trees
A distribution over trees
Coalescent as a graphical model
Coalescent as a graphical model
Coalescent as a graphical model
Coalescent as a graphical model
Efficient InferenceConstruct trees in a bottom-up manner
Greedy: At each step, pick optimal pair (maximizes joint likelihood) and time to coalesce (branch length)Infer values of internal nodes by belief propagation
Not all domains created equalInference by EM:E: compute expectations over weightsM: maximize tree structure
K Nkw kyxw 0
0Message passing on coalescent tree; efficiently done by belief propagationGreedy agglomerative clustering algorithmData tree versus inferred tree
Some experimental results
Parameter-based ReferencesO. Chapelle and Z. Harchaoui. A machine learning approach to conjoint analysis. NIPS, 2005.
K. Yu, V. Tresp, and A. Schwaighofer. Learning Gaussian processes from multiple tasks. ICML, 2005.
Y. Xue, X. Liao, L. Carin, and B. Krishnapuram. Multi-task learning for classication with Dirichlet process priors. JMLR, 2007.
H. Daum III. Bayesian Multitask Learning with Latent Hierarchies. UAI 2009.Tutorial OutlineNotation and Common ConceptsSemi-supervised AdaptationCovariate shiftLearning Shared RepresentationsSupervised AdaptationFeature-Based ApproachesParameter-Based ApproachesOpen Questions and Uncovered Algorithms111
Theory and PracticeHypothesis classes from projections :
30
1001...
Minimize divergence
2) 112Open QuestionsMatching Theory and Practice Theory does not exactly suggest what practitioners doPrior Knowledge
F2-F1F1-F0/i//i//e//e/
Speech RecognitionCross-Lingual Grammar Projection113More Semi-supervised AdaptationSelf-training and Co-training [1] D. McClosky et al. Reranking and Self-Training for Parser Adaptation. 2006.
[2] K. Sagae & J. Tsuji. Dependency Parsing and Domain Adaptation with LR Models and Parser Ensembles. 2007.Structured Representation Learning[3] F. Huang and A. Yates. Distributional Representations for Handling Sparsity in Supervised Sequence Labeling. 2009.http://adaptationtutorial.blitzer.com/references/114What is a domain anyway?Time?News the day I was born vs news today?News yesterday vs news today?Space?News back home vs news in Haifa?News in Tel Aviv vs news in Haifa?
Do my data even come with a domain specified?Suggest a continuous structureStream of datawith y and d sometimes hidden?
Were all domains: personalizationadapt learn across millions of domains?share enough information to be useful?share little enough information to be safe?avoid negative transfer?avoid DAAM (domain adaptation spam)?
ThanksQuestions?117