learning representations of language for domain adaptation

70
Learning Representations of Language for Domain Adaptation Alexander Yates Fei (Irene) Huang Temple University Temple University Computer and Information Sciences Computer and Information Sciences

Upload: efrat

Post on 17-Jan-2016

36 views

Category:

Documents


0 download

DESCRIPTION

Alexander Yates Fei (Irene) Huang. Learning Representations of Language for Domain Adaptation. Temple University Computer and Information Sciences. Outline. Representations in NLP Machine Learning / Data mining perspective Linguistics perspective Domain Adaptation - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Learning Representations of Language  for Domain Adaptation

Learning Representations of Language

for Domain Adaptation

Alexander Yates

Fei (Irene) Huang

Temple UniversityTemple UniversityComputer and Information SciencesComputer and Information Sciences

Page 2: Learning Representations of Language  for Domain Adaptation

Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences

2

Outline

• Representations in NLP– Machine Learning / Data mining perspective– Linguistics perspective

• Domain Adaptation

• Learning Representations

• Experiments

Page 3: Learning Representations of Language  for Domain Adaptation

Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences

A sequence-labeling task

Identify phrases that name birds and cats.

BIRDThrushes build cup-shaped nests, sometimes lining them with mud.

CAT

Sylvester was #33 on TV Guide's list of top 50 best cartoon characters,

BIRD

together with Tweety Bird.

Page 4: Learning Representations of Language  for Domain Adaptation

Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences

Machine Learning

Quick formal background:

Let X be a set of all possible data points (e.g., all English sentences)

Let Z be the space of all possible predictions (e.g., all sequences of labels)

A target is a function f: X Z that we’re trying to learn

A learning machine is an algorithm L

Input: a set of examples S = {xi} drawn from distribution D, and a label zi = f(xi) for each example.

Output: a hypothesis h: X Z that minimizes Ex~D 1[h(x) ≠ f(x) ]

Page 5: Learning Representations of Language  for Domain Adaptation

Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences

Representations for Learning

Most NLP systems first transform the raw data into a more convenient representation.

• A representation is a function , for some suitable feature space Y, like .

• A feature is a dimension in the feature space Y.

• Alternatively, we may use the word feature to refer to a value for one component of R(x), for some representation R and instance x.

A learning machine L takes as input a set of examples (R(xi),f(xi)) and returns h: Y Z.

YXR :dR

Page 6: Learning Representations of Language  for Domain Adaptation

Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences

A traditional NLP representation

Thrushes build cup-shaped nests, …

0

0

1

1

0

0

0

1

1

0

Word = ‘build’

Word = ‘nests’

Word = ‘thrushes’

Next word is ‘build’

Next word is ‘nests’

Previous word is ‘build’

Previous word is ‘thrushes’

Is capitalized

Ends with ‘–s’

Ends with ‘–ed’

1

0

0

0

0

0

1

0

0

0

0

0

0

0

1

1

0

0

0

1

0

1

0

0

0

0

0

0

1

0

Feature sets are carefully-engineered for specific tasks, but usually include at least the word-based features.

Page 7: Learning Representations of Language  for Domain Adaptation

Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences

Sparsity

• Common indicators for birds: • feather, beak, nest, egg, wing

• Uncommon indicators: • aviary, archaeopteryx, insectivorous, warm-blooded

“The jackjaw stood irreverently on the scarecrow’s shoulder.”

• Sparsity analysis of Collins parser: (Bikel, 2004)

– bilexical statistics are available in < 1.5% of parse decisions

Page 8: Learning Representations of Language  for Domain Adaptation

Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences

Sparsity in biomedical POS tagging• Most part-of-speech taggers are trained on

newswire text (Penn Treebank)• In a standard biomedical data set, fully 23%

of words never appear in the Penn Treebank

Tagger Newswire accuracy

Biomedical accuracy

Unknown word

accuracy

CRF with word and orthographic

features

94.0% 88.3% 67.3%

SCL (Blitzer et al.,

2006)

- 88.9% 72.0%

Page 9: Learning Representations of Language  for Domain Adaptation

Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences

Polysemy

• The word “thrush” is not necessarily an indicator of a bird:– Thrush is the term for an overgrowth of yeast in a baby's mouth.

– Thrush products have been a staple of hot rodders for over 40 years as these performance mufflers bring together the power and sound favored by true enthusiasts.

• “Leopard”, “jaguar”, “puma”, “tiger”, “lion”, etc. all have various meanings as cats, operating systems, sports teams, and so on.

• Word meanings depend on their contexts, and word-based features do not capture this.

Page 10: Learning Representations of Language  for Domain Adaptation

Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences

Embeddings

• Kernel trick: – implicitly embed data points in higher-dimensional space

• Dimensionality reduction:– Embed data points in a lower-dimensional space– Common technique in text mining, combined with vector

space models• PCA, LSA, SVD (Deerweester et al., 1990)

• Self-organizing maps (Honkela, 1997)

• Independent component analysis (Sahlgren, 2005)

• Random indexing (Väyrynen et al., 2007)

• But existing embedding techniques ignore linguistic structure.

Page 11: Learning Representations of Language  for Domain Adaptation

Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences

A representation from linguistics

Many modern linguistic theories (GPSG, HPSG, LFG, etc.) treat language as a small set of constraints over a large number of lexical features.

But lexical entries are painstakingly crafted by hand.

thrushesHEAD Noun

SINGULAR -

COUNT +

VAL-SPR Det[SINGULAR-]

VAL-COMP None

SEM-AGENCY +

… …

buildHEAD Verb

VFORM Infinite

AUX Minus

VAL-SUBJ Noun[SINGULAR-, SEM-AGENCY+]

VAL-COMP

Noun

… …

Page 12: Learning Representations of Language  for Domain Adaptation

Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences

12

Outline

• Representations in NLP

• Domain Adaptation

• Learning Representations

• Experiments

Page 13: Learning Representations of Language  for Domain Adaptation

Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences

Domains

Definition: A domain is a subset of language that is related through genre, topic, or style.

Examples:

newswire text

science fiction novels

biomedical research literature

Page 14: Learning Representations of Language  for Domain Adaptation

Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences

Domain Dependence

Newswire Domain… isn’t signaling (verb) a recession …

… acquiring the company, signaling (verb) to others that …

… in that list, signaling (verb) that all the company’s coal and …

Dow officials were signaling (verb) that the company …

… the S&P was signaling (verb) that the Dow could fall …

Biomedical Domain… factor for Wnt signaling (noun), …

… for the Wnt signaling (noun) pathway via …

... in a novel signaling (noun) pathway from an extracellular guidance cue …

… in the Wnt signaling (noun) pathway, and mutation …

Page 15: Learning Representations of Language  for Domain Adaptation

Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences

Domain adaptation: a hard test for NLP

Formally, a domain is a probability distribution

D over the instance set Xe.g., sentences in newswire domain ~ DNews(X)

sentences in biomedical domain ~ DBio(X)

In domain adaptation, a learning machine is given training examples from a source domain

The hypothesis is then tested on data points drawn from a separate target domain.

Page 16: Learning Representations of Language  for Domain Adaptation

Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences

Learning theory for domain adaptation

A recently-proved theorem:

The error rate of h on target domain T after being trained on source domain S depends on:

1. the error rate of h on the source domain S

2. the distance between S and T• The claim depends on a particular notion of

“distance” between probability distributions S and T

[Ben-David et al., 2009]

Page 17: Learning Representations of Language  for Domain Adaptation

Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences

Formal version

Page 18: Learning Representations of Language  for Domain Adaptation

Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences

18

Outline

• Representations in NLP– Machine Learning / Data mining perspective– Linguistics perspective

• Domain Adaptation

• Learning Representations

• Experiments

Page 19: Learning Representations of Language  for Domain Adaptation

Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences

Objectives for (lexical) representations1. Usefulness: We want features that help in learning the target

function.

2. Non-Sparsity: We want features that appear commonly in reasonable amounts of training data.

3. Context-dependence: We want features that somehow depend on, or take into account, the context of a word.

4. Minimal domain distance: We want features that appear approximately as often in one domain as any other.

5. Automation: We don’t want to have to manually construct the features.

Page 20: Learning Representations of Language  for Domain Adaptation

Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences

Representation learning

Thrushes build cup-shaped nests

R (representation)

F1 F2 F3 F4 F5 F6 F7 F8 … Fd

0.1 -7 21 0 2 0 12.1 5 … 1dR

h (hypothesis)

BIRD X X X

We learn this

Why not learn this, too?!

Page 21: Learning Representations of Language  for Domain Adaptation

Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences

1) Ngram Models for Representations

finches thrushes

---------------ngram--------------- Prob

- are plain .0001

- range from .001

- inhabit

wooded

.0001

- sing .0001

actually - are .0005

true - are .001

Darwin’s - .0001

Galapagos

- .0001

true - .01

---------------ngram--------------- Prob

- eat worms .001

- are plump .001

- lay two .0005

- sing .0005

- build cup-shaped

.0001

large - in .002

traditional

- genera

.0001

soft-plumage

d

- .001

Page 22: Learning Representations of Language  for Domain Adaptation

Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences

1) Ngram Models for Representations

finches thrushes

---------------ngram--------------- Prob

- are plain .0001

- range from .001

- inhabit

wooded

.0001

- sing .0001

actually - are .0005

true - are .001

Darwin’s - .0001

Galapagos

- .0001

true - .01

---------------ngram--------------- Prob

- eat worms .001

- are plump .001

- lay two .0005

- sing .0005

- build cup-shaped

.0001

large - in .002

traditional

- genera

.0001

soft-plumage

d

- .001

Page 23: Learning Representations of Language  for Domain Adaptation

Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences

1) Ngram Models for Representations

finches thrushes

feature value

- are plain .0001

- range from .001

- inhabit

wooded

.0001

- sing .0001

actually - are .0005

true - are .001

Darwin’s - .0001

Galapagos

- .0001

true - .01

feature value

- eat worms .001

- are plump .001

- lay two .0005

- sing .0005

- build cup-shaped

.0001

large - in .002

traditional

- genera

.0001

soft-plumage

d

- .001

Page 24: Learning Representations of Language  for Domain Adaptation

Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences

1) Ngram Models for Representations

True finches are predominantly seed-eating songbirds.

Training

0

0

.0001

.07

0

0

0

1

1

0

0

X BIRD X X X BIRD

- sing

Page 25: Learning Representations of Language  for Domain Adaptation

Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences

1) Ngram Models for Representations

Thrushes build cup-shaped nests, sometimes …

Testing0.001

0

.0005

0

0

1

0

0

0

1

0

BIRD X X X X

- sing

Ngram features:• Advantages:• Automated• Useful

• Disadvantages• Sparse• Not context-dependent

Page 26: Learning Representations of Language  for Domain Adaptation

Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences

Pause: let’s generalize the procedure1. Train a language model on lots of (unlabeled)

text (preferably from multiple domains)

2. Use the language model to annotate (labeled) training and test texts with latent information

3. Use the annotations as features in a CRF

4. Train and test CRF as usual

Page 27: Learning Representations of Language  for Domain Adaptation

Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences

Pause: how to improve procedure?

The main idea we’ve explored is –

cluster words into sets of related words

use the clusters as features

We can control the number of clusters, to make the features less sparse.

Page 28: Learning Representations of Language  for Domain Adaptation

Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences

2) Distributional Clustering

True finches are predominantly seed-eating songbirds.

1) Construct a Naïve Bayes model for generating trigrams• The parent node is a latent state with K possible values• Trigrams are generated according to Pleft(word | parent),

Pmid(word | parent) and Pright(word | parent)

Page 29: Learning Representations of Language  for Domain Adaptation

Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences

2) Distributional Clustering – NB

True finches are predominantly seed-eating songbirds.

2) Train the prior P(parent) and conditional distributions on a large corpus using EM, treating all trigrams as independent.

3) For each token in training and test sets, determine the best value of the latent state, and use it as a new feature.

Thrushes build cup-shaped nests, …

Page 30: Learning Representations of Language  for Domain Adaptation

Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences

2) Distributional Clustering – NB

Advantages over ngram features1) Sparsity: only K features, so each should be common2) Context-dependence: The new feature depends not just on the token at position i, but also on tokens at i-1 and i+1

Potential problems:1) Features are only sensitive to immediate neighbors2) The model requires 3 observation distributions, each of which will be sparsely observed.3) Did we throw out too much of the information in the ngram model by reducing the dimensionality too far?

Page 31: Learning Representations of Language  for Domain Adaptation

Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences

3) Distributional clustering - HMMs

finches are predominantly seed-eating songbirdsTrue

Hidden Markov ModelOne latent node yi per token xi

A conditional observation distribution Pobs(xi | yi)A conditional transition distribution Ptrans(yi | yi-1)A prior distribution Pprior(y0)

Joint probability P(x, y) given by

N

iiiobsiitransobsprior yxPyyPyxPyPyxP

21111 )|()|()|()(),(

Page 32: Learning Representations of Language  for Domain Adaptation

Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences

3) Distributional clustering - HMMs

finches are predominantly seed-eating songbirdsTrue

1) Train the prior and conditional distributions on a large corpus using EM.

2) Use the Viterbi algorithm to find the best setting of all latent states for a given sentence.

3) Use the latent state value yi as a new feature for xi.

build cup-shaped nestsThrushes

Page 33: Learning Representations of Language  for Domain Adaptation

Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences

3) Distributional Clustering – HMMsAdvantages over NB features

1) Sparsity: same number of features, but the HMM model itself is less sparse -- it includes only one observation distribution2) Context-dependence: The new feature depends (indirectly) on the whole observation sequence

Potential problem:Did we throw out too much of the information in the ngram model by reducing the dimensionality too far?

Page 34: Learning Representations of Language  for Domain Adaptation

Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences

4) Multi-dimensional clustering

finches are predominantly seed-eating songbirdsTrue

Independent HMM (I-HMM) model:L layers of HMM models, each trained independently.Each layer’s parameters are initialized randomly for EM.

Page 35: Learning Representations of Language  for Domain Adaptation

Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences

4) Multi-dimensional clustering

finches are predominantly seed-eating songbirdsTrue

As before, we decode each layer using the Viterbi algorithm to generate features.Each layer represents a random projection from the full feature space to K boolean dimensions.

Page 36: Learning Representations of Language  for Domain Adaptation

Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences

4) Multi-dimensional clustering

Advantages over HMM features1) Usefulness: closer to the lexical representation from linguistics2) Usefulness: can represent KL points (instead of just K)

Potential problem:Each layer is trained independently, so are they really providing additional (rather than overlapping) information?

Page 37: Learning Representations of Language  for Domain Adaptation

Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences

37

Outline

• Representations in NLP– Machine Learning / Data mining perspective– Linguistics perspective

• Domain Adaptation

• Learning Representations

• Experiments

Page 38: Learning Representations of Language  for Domain Adaptation

Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences

Experiments

• Part-of-speech tagging (and chunking)– Train on newswire text– Test on biomedical text(Huang and Yates, ACL 2009; Huang and Yates, DANLP 2010)

• Semantic role labeling– Train on newswire text– Test on fiction text(Huang and Yates, ACL 2010)

Page 39: Learning Representations of Language  for Domain Adaptation

Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences

Part-of-Speech (POS) taggingTagger Biomedical

accuracyUnknown

word accuracy

baseline: CRF with word and orthographic features

88.3% 67.3%

baseline + NB features 88.4% 69.3%

SCL (Blitzer et al., 2006)

88.9% 72.0%

baseline + HMM features 90.5% 75.2%

baseline + Web ngram features

93.1% 75.6%

baseline + I-HMM features (7 layers)

93.3% 76%

SCL+500 labeled biomedical

sentences

96.1 -Except for the Web ngram features, all features derived from the Penn Treebank plus 70,000 sentences of unlabeled biomedical text.

Page 40: Learning Representations of Language  for Domain Adaptation

Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences

Sparsity

Sparse Not Sparse

Num. tokens 463 12194

Baseline 52.5 89.6

Web Ngrams 61.8 94.0

NB (-Ngram)

57.8(-4.0)

89.4(-4.6)

HMM (-Ngram)

60.2(-1.6)

91.6(-2.4)

I-HMM (-Ngram)

62.9(+1.1)

94.5(+0.5)

Sparse: The word appears 5 times or fewer in all of our unlabeled text.Not Sparse: The word appears 50 times or more in all of our unlabeled text.

Graphical models perform better on sparse words than not-sparse words, relative to Ngram models.

Page 41: Learning Representations of Language  for Domain Adaptation

Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences

Polysemy

Polysemous Not Polysemous

Num. tokens 159 4321

Baseline 59.5 78.5

Web Ngrams 68.2 85.3

NB (-Ngram)

64.5(-3.7)

88.7(+3.4)

HMM (-Ngram)

67.9(-0.3)

83.4(-1.9)

I-HMM (-Ngram)

75.6(+7.4)

85.2(-0.1)

Polysemous: The word is associated with multiple, unrelated POS tags.Not Polysemous: The word has only 1 POS tag in all of our labeled text.

Graphical models perform better on polysemous words than not-polysemous words, relative to Ngram models (except for NB).

Page 42: Learning Representations of Language  for Domain Adaptation

Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences

Accuracy vs. domain distance

• Distance is measured as the Jensen-Shannon divergence between frequencies of features in S and T.

• For I-HMMs, we weighted the distance for each layer by the proportion of CRF parameter weights placed on that layer.

Page 43: Learning Representations of Language  for Domain Adaptation

Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences

Biomedical NP Chunking

The I-HMM representation can reduce error by over 57% relative to a standard representation, when training on news text and testing on biomedical journal text.

Page 44: Learning Representations of Language  for Domain Adaptation

Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences

Chinese POS Tagging

Domain Tokens Stanford Chinese Tagger CRF + HMM

Lore 5428 88.4 89.7*

Religion 3248 83.5 85.2

Humour 3326 89.0 89.6

General fiction 4913 87.5 89.4**

Essay 5214 88.4 89.0

Mystery 5774 87.4 90.1**

Romance 5489 87.5 89.0*

Science-fiction 3070 88.6 87.0

Skills 5464 82.7 84.9**

Science 5262 86.0 87.8**

Adventure fiction 5071 82.1** 80.0

Report 6662 91.7 91.9

News 9774 98.8** 96.9

All but news 58921 87.0 88.1**

All domains 68695 88.7 89.5**

HMMs can beat a state-of-the-art system on many different domains.

Page 45: Learning Representations of Language  for Domain Adaptation

Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences

Semantic Role Labeling (SRL)

(aka, Shallow Semantic Parsing)

Input: 1) Training sentences, labeled with syntax and semantic roles

2) A new sentence, and its syntax

Output:The predicate, arguments, and their roles

Example output:

build cup-shaped nestsThrushes

Predicate Thing BuiltBuilder

Page 46: Learning Representations of Language  for Domain Adaptation

Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences

Parsing

Chris broke the window with a hammer

Proper Noun

Verb Det. Noun Prep. Det. Noun

NP NP

PP

VP

S

NP

Subject

Direct Object

Page 47: Learning Representations of Language  for Domain Adaptation

Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences

Semantic Role Labeling

Chris broke the window with a hammer

Proper Noun

Verb Det. Noun Prep. Det. Noun

NP NP

PP

VP

S

NP

Breaker

Thing broken

Means

Page 48: Learning Representations of Language  for Domain Adaptation

Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences

Semantic Role Labeling

The window broke

Det. Noun Verb

NP VP

S

Thing broken

Subject

Page 49: Learning Representations of Language  for Domain Adaptation

Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences

Simple, open-domain SRL

Chris broke the window with a hammer

Proper Noun

Verb Det. Noun Prep. Det. Noun

B-NP B-VP B-NP I-NP B-PP B-NP I-NP

-1 0 +1 +2 +3 +4 +5

POS tag

Chunk tag

dist. from predicate

SRL Label Breaker Pred Thing Broken Means

Baseline Features

Page 50: Learning Representations of Language  for Domain Adaptation

Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences

HMM label

Simple, open-domain SRL

Chris broke the window with a hammer

Proper Noun

Verb Det. Noun Prep. Det. Noun

B-NP B-VP B-NP I-NP B-PP B-NP I-NP

-1 0 +1 +2 +3 +4 +5

POS tag

Chunk tag

dist. from predicate

SRL Label Breaker Pred Thing Broken Means

Baseline +HMM

Page 51: Learning Representations of Language  for Domain Adaptation

Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences

The importance of paths

Chris [predicate broke] [thing broken a hammer]

Chris [predicate broke] a window with [means a hammer]

Chris [predicate broke] the desk, so she fetched

[not an arg a hammer] and nails.

Page 52: Learning Representations of Language  for Domain Adaptation

Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences

Simple, open-domain SRL

Chris broke the window with a hammer

None None None the the-window

the-window-

withthe-window-

with-aWord path

SRL Label Breaker Pred Thing Broken Means

Baseline +HMM + Paths

Page 53: Learning Representations of Language  for Domain Adaptation

Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences

Simple, open-domain SRL

Chris broke the window with a hammer

None None None the the-window

the-window-

withthe-window-

with-aWord path

SRL Label Breaker Pred Thing Broken Means

Baseline +HMM + Paths

Det Det-Noun

Det-Noun-Prep

Det-Noun-Prep-Det

POS path None None None

Page 54: Learning Representations of Language  for Domain Adaptation

Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences

Simple, open-domain SRL

Chris broke the window with a hammer

None None None the the-window

the-window-

withthe-window-

with-aWord path

SRL Label Breaker Pred Thing Broken Means

Baseline +HMM + Paths

Det Det-Noun

Det-Noun-Prep

Det-Noun-Prep-Det

POS path None None None

HMM path None None None

Page 55: Learning Representations of Language  for Domain Adaptation

Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences

Experimental results – F1

All systems were trained on newswire text from the Wall Street Journal (WSJ), and tested on WSJ and fiction texts from the Brown corpus (Brown).

Page 56: Learning Representations of Language  for Domain Adaptation

Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences

Experimental results – F1

All systems were trained on newswire text from the Wall Street Journal (WSJ), and tested on WSJ and fiction texts from the Brown corpus (Brown).

Page 57: Learning Representations of Language  for Domain Adaptation

Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences

Span-HMMs

Page 58: Learning Representations of Language  for Domain Adaptation

Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences

Span-HMM features

Chris broke the window with a hammer

Span-HMM for “hammer”

SRL Label Breaker Pred Thing Broken Means

Span-HMM Features

Span-HMM feature

Page 59: Learning Representations of Language  for Domain Adaptation

Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences

Span-HMM features

Chris broke the window with a hammer

Span-HMM for “hammer”

SRL Label Breaker Pred Thing Broken Means

Span-HMM Features

Span-HMM feature

Page 60: Learning Representations of Language  for Domain Adaptation

Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences

Span-HMM features

Chris broke the window with a hammer

Span-HMM for “a”

SRL Label Breaker Pred Thing Broken Means

Span-HMM Features

Span-HMM feature

Page 61: Learning Representations of Language  for Domain Adaptation

Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences

Span-HMM features

Chris broke the window with a hammer

Span-HMM for “a”

SRL Label Breaker Pred Thing Broken Means

Span-HMM Features

Span-HMM feature

Page 62: Learning Representations of Language  for Domain Adaptation

Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences

Span-HMM features

Chris broke the window with a hammer

SRL Label Breaker Pred Thing Broken Means

Span-HMM Features

Span-HMM feature

None None None

Page 63: Learning Representations of Language  for Domain Adaptation

Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences

Experimental results – SRL F1

All systems were trained on newswire text from the Wall Street Journal (WSJ), and tested on WSJ and fiction texts from the Brown corpus (Brown).

Page 64: Learning Representations of Language  for Domain Adaptation

Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences

Experimental results – feature sparsity

Page 65: Learning Representations of Language  for Domain Adaptation

Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences

Benefit grows with distance from predicate

Page 66: Learning Representations of Language  for Domain Adaptation

Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences

Take-away lessons (1)

• Hand-crafted feature sets can be beaten.

– Distributional similarity (Harris, 1954) is an extremely valuable feature for many NLP applications.

– Features based on distributional similarity, derived from a large corpus, complement traditional features.

Page 67: Learning Representations of Language  for Domain Adaptation

Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences

Take-away lessons (2)

• Context-dependent features matters a lot– Ngram models have their advantages, but

not so much as representations.• Features are not dependent on local context• Features are sparse• Even web-scale models are outperformed by more

sophisticated models trained on small datasets

– HMMs significantly outperform NB clustering

Page 68: Learning Representations of Language  for Domain Adaptation

Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences

Take-away lessons (3)

• The trend is for more sophisticated models to perform better than simpler models (!)– In contrast to the received wisdom that

more data > better models

(Banko and Brill, 2001)

– The community has not yet figured out the “right way” to define and measure distributional similarity.

Page 69: Learning Representations of Language  for Domain Adaptation

Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences

Open problems and future work

• We need a mechanism for controlling for distance between domains in our feature sets.

• More sophisticated models for representations:– Tree-based, rather than sequential, models– Non-independent, multi-dimensional models

• Sophisticated models on larger corpora

Page 70: Learning Representations of Language  for Domain Adaptation

Temple UniversityTemple UniversityComputer and Information Computer and Information SciencesSciences

Acknowledgments

Northwestern EECS

Prof. Doug Downey

Arun Ahuja

Temple CIS

Fei (Irene Huang)

Prof. Yuhong Guo

Avirup Sil

Anjan Nepal