practical structured learning for natural language...
TRANSCRIPT
Structured Learning for LanguageSlide 1
Hal Daumé III ([email protected])
Practical Structured Learningfor Natural Language Processing
Hal Daumé IIISchool of Computing
University of Utah
CS Colloquium, Brigham Young University, 14 Sep 2006
Structured Learning for LanguageSlide 2
Hal Daumé III ([email protected])
What Is Natural Language Processing?Processing of language to solve a problem
Machine Translation
Summarization
Information Extraction
Question Answering
Parsing
Understanding
Input OutputArabic sentence English sentence
Long documents Short documents
Document Database
Corpus + query
Sentence
Sentence
Answer
Syntactic tree
Logical sentence
All have innatestructure to
varying degrees
Corpusbased NLP:Collect example input/output pairs and learn
Structured Learning for LanguageSlide 3
Hal Daumé III ([email protected])
What is Machine Learning?Automatically uncovering structure in data
Classification
Regression
Dimensionality reduction
Clustering
Sequence labeling
Reinforcement learning State space + observations
Input OutputVector Bit (or bits)
Vector Real number
Vector Shorter vector
Vectors
Vectors
Partition
Labels
Policy
Structured Learning for LanguageSlide 4
Hal Daumé III ([email protected])
NLP versus Machine LearningMachine TranslationSummarizationInformation ExtractionQuestion AnsweringParsingUnderstanding
Arabic sentence English sentenceLong documents Short documentsDocument DatabaseCorpus + querySentenceSentence
AnswerSyntactic treeLogical sentence
ClassificationRegressionDimensionality reductionClusteringSequence labeling
Vector Bit (or bits)Vector Real numberVector Shorter vectorVectorsVectors
PartitionLabels
Reinforcement learninginspired
State space Policy
EncodingAd HocSearch
ProvablyGood
NaturalDecomposition
Structured Learning for LanguageSlide 5
Hal Daumé III ([email protected])
Entity Detection and Tracking
Shakespeare scholarship, lively at the best of times, saw the fur flying yesterday after a German academic claimed to have authenticated not just one but four contemporary images of the playwright and suggested, to boot, that he had died of cancer.
As the National Portrait Gallery planned to reveal that only one of half a dozen claimed portraits of William Shakespeare can now be considered genuine, Prof Hildegard HammerschmidtHummel said she could prove that there were at least four surviving portraits of the playwright.
Why? Summarization, Machine Translation, etc...
§1.3
(Exa
mple
Prob
lem: E
DT)
Structured Learning for LanguageSlide 6
Hal Daumé III ([email protected])
Why is EDT Hard?
Longrange dependencies
Syntax, Semantics and Discourse
World Knowledge
Highly constrained outputs
Shakespeare scholarship, lively at the best of times, saw the fur flying yesterday after a German academic claimed to have authenticated not just one but four contemporary images of the playwright and suggested, to boot, that he had died of cancer.
§1.3
(Exa
mple
Prob
lem: E
DT)
Structured Learning for LanguageSlide 7
Hal Daumé III ([email protected])
How is EDT Typically Attacked?
Shakespeare scholarship, lively at the best of times, saw the fur flying yesterday after a German academic claimed to have authenticated not just one but four contemporary images of the playwright and suggested, to boot, that he had died of cancer.
Shakespeare scholarship , ... a German academic claimed ... PER * * * GPE PER *
Shakespeare
playwright
German
academiche
Sequence Labeling
Binary Classification + Clustering
§5.2
(EDT
: Prio
r Wor
k)
Structured Learning for LanguageSlide 8
Hal Daumé III ([email protected])
Learning in NLP Applications
Model
Features
Learning
Search
Structured Learning for LanguageSlide 9
Hal Daumé III ([email protected])
Result: Searn (= “Search + Learn”)
Computationally efficient
Strong theoretical guarantees
Stateoftheart performance
Broadly applicable
Easy to implement
§3 (S
earch
-bas
ed S
tructu
red P
redic
tion)
Structured Learning for LanguageSlide 10
Hal Daumé III ([email protected])
Talk Outline➢ Searn: Searchbased Structured Prediction
➢ Experimental Results:➢ Sequence Labeling➢ Entity Detection and Tracking➢ Automatic Document Summarization
➢ Discussion and Future Work
Structured Learning for LanguageSlide 11
Hal Daumé III ([email protected])
Structured Prediction
Formulated as a maximization problem:
y = arg max y∈Y x f x , y ;
Search Model Features Learning
Search (and learning) tractableonly for very simple Y(x) and f
§2.2
(Mac
hine L
earn
ing: S
tructu
red P
redic
tion)
Structured Learning for LanguageSlide 12
Hal Daumé III ([email protected])
Reduction for Structured Prediction➢ Idea: view structured prediction in light of search
NPD
VR
NPD
VR
NPD
VR
Each step here looks like it could be represented as a weighted multiclass problem.
Can we formalize this idea?
I sing merrily
§3.3
(Sea
rch-b
ased
Stru
cture
d Pre
dictio
n)
Structured Learning for LanguageSlide 13
Hal Daumé III ([email protected])
ErrorLimiting Reductions➢ A formal mapping between learning problems
LearningProblem
“A”
LearningProblem “B”
What we wantto solve
What we know howto solve
F
F 1
➢ F maps examples from A to examples from B➢ F 1 maps a classifier h for B to a classifier that solves A➢ Error limiting: good performance on B implies good
performance on A
[Beygelzimer et al.,ICML 2005]
§2.3
(Mac
hine L
earn
ing: R
educ
tions
)
Structured Learning for LanguageSlide 14
Hal Daumé III ([email protected])
Our Reduction Goal
StructuredPrediction
BinaryClassification
ImportanceWeighted
Classification
“Searn”
Costsensitive
Classification
“Weighted All Pairs”[Beygelzimer et al.,
ICML 2005]
“Costing”[Zadronzy,
Langford &Abe,
ICDM 2003]
➢ Errorboundsget composed
➢ New techniques for binaryclassification lead to newtechniques for structured prediction
➢ Any generalization theory for binaryclassification immediately appliesto structured prediction
§2.3
(Mac
hine L
earn
ing: R
educ
tions
)
Structured Learning for LanguageSlide 15
Hal Daumé III ([email protected])
Two Easy ReductionsCosting [Zadronzy, Langford &
Abe, ICDM 2003]
WeightedAllPairs[Beygelzimer, Dani, Hayes,Langford and Zadrozny, ICML 2005]
X×{±1}×ℝ≥0
X×{±1}
hh hh hh L IW ≤L BC E [w]
X×{±1}×ℝ≥0X×ℝ
≥0×⋯×ℝ
≥0
1v21v2 1v31v3 1vK1vK
2v32v3 2vK2vK......
L CS ≤2 L IW
§2.3
(Mac
hine L
earn
ing: R
educ
tions
)
Structured Learning for LanguageSlide 16
Hal Daumé III ([email protected])
A First Attempt
AND
VR
AND
VR
AND
VR
I sing merrily
At eachcorrect state: Train classifier to choose next state
Thm (Kääriäinen): There is a 1storder binary Markov problem such that a classification error implies a Hamming loss of:
T2−
1−1−2T1
4
12≈
T2
§3.3
(Sea
rch-b
ased
Stru
cture
d Pre
dictio
n)
Structured Learning for LanguageSlide 17
Hal Daumé III ([email protected])
Reducing Structured Prediction
AND
VR
AND
VR
AND
VR
I sing merrily
Key Assumption: Optimal Policy for training data
Weak!
Given: input, true output and state;Return: best successor state
§3.4.
2 (Se
arn:
Optim
al Po
licy)
Structured Learning for LanguageSlide 18
Hal Daumé III ([email protected])
Idea
Optimal Policy Learned Policy(Features)
§3.4
(Sea
rn: T
raini
ng)
Structured Learning for LanguageSlide 19
Hal Daumé III ([email protected])
Generate classification problems
Iterative Algorithm: Searn
Set current policy = optimal policy
Decode using current policy
... after a German academic claimed ...
* * GPE PER *
Repeat:
PER
ORG
*
* *
PER *
PER *
L = 0.3
L = 0
L = 0.2
L = 0.25
Learn new multiclass classifierInterpolate: cur = *cur + (1)*new
§3.4.
3 (Se
arn:
Algo
rithm)
Structured Learning for LanguageSlide 20
Hal Daumé III ([email protected])
Theoretical Analysis
Theorem: For conservative , after 2T3 ln Titerations, the loss of the learned policy is bounded as follows:
L h ≤ L h0 2 T lnT lavg 1lnT cmax
T
Loss of theoptimal policy
Averagemulticlass
classificationloss
Worst caseperstep
loss
§3.5
(Sea
rn: T
heor
etica
l Ana
lysis)
Structured Learning for LanguageSlide 21
Hal Daumé III ([email protected])
Why Does Searn Work?
A classifier tells us how to search
Impossible to train on full space
Want to train only where we'll wind up
Searn tells us where we'll wind up
§3.7
(Sea
rn: D
iscus
sion a
nd C
onclu
sions
)
Structured Learning for LanguageSlide 22
Hal Daumé III ([email protected])
Talk Outline➢ Searn: Searchbased Structured Prediction
➢ Experimental Results:➢ Sequence Labeling➢ Entity Detection and Tracking➢ Automatic Document Summarization
➢ Discussion and Future Work
Structured Learning for LanguageSlide 23
Hal Daumé III ([email protected])
Proof on Concept: Sequence LabelsSpanish Named
Entity Recognition
94
95
96
97
98
SVM
ISO
CR
FSe
arn
Sear
n+da
ta
65
70
75
80
85
90
95
HandwritingRecognition
LR SVM
M3N
Sear
nSe
arn+
data
93
94
95
96
97
Chunking+Tagging
Incr
emen
tal P
erce
ptro
n
LaSO
CR
F
Sear
n
§4.4
(Seq
uenc
e Lab
eling
: Com
paris
on)
Structured Learning for LanguageSlide 24
Hal Daumé III ([email protected])
Talk Outline➢ Searn: Searchbased Structured Prediction
➢ Experimental Results:➢ Sequence Labeling➢ Entity Detection and Tracking➢ Automatic Document Summarization
➢ Discussion and Future Work
Structured Learning for LanguageSlide 25
Hal Daumé III ([email protected])
Entity Detection and Tracking
Shakespeare scholarship, lively at the best of times, saw the fur flying yesterday after a German academic claimed to have authenticated not just one but four contemporary images of the playwright and suggested, to boot, that he had died of cancer.
... he had died of cancer
... he had died of cancer
... he had died of cancer
... he had died of cancer
... he had died of cancer
... he had died of cancer
... he had died of cancer
...
§5.6.
1 (ED
T: Jo
int D
etecti
on an
d Cor
efere
nce:
Sear
ch)
Structured Learning for LanguageSlide 26
Hal Daumé III ([email protected])
EDT Features
➢ Lexical➢ Syntactic➢ Semantic➢ Gazetteer➢ Knowledgebased
Shakespeare scholarship, lively at the best of times, saw the fur flying yesterday after a German academic claimed to have authenticated not just one but four contemporary images of the playwright and suggested, to boot, that he had died of cancer.
§5.[4
5].3 (
EDT:
Fea
ture F
uncti
ons)
Structured Learning for LanguageSlide 27
Hal Daumé III ([email protected])
EDT Results (ACE 2004 data)
Sys1 Sys2 Sys3 Sys4 Searn72
74
76
78
80
Previous state of the artwith extra proprietary data§5
.6.3 (
EDT:
Exp
erim
ental
Res
ults)
Structured Learning for LanguageSlide 28
Hal Daumé III ([email protected])
Feature Contributions
Lex Syn Sem Gaz KB68
70
72
74
76
78
80
82
§5.6.
3 (ED
T: E
xper
imen
tal R
esult
s)
Structured Learning for LanguageSlide 29
Hal Daumé III ([email protected])
Talk Outline➢ Searn: Searchbased Structured Prediction
➢ Experimental Results:➢ Sequence Labeling➢ Entity Detection and Tracking➢ Automatic Document Summarization
➢ Discussion and Future Work
Structured Learning for LanguageSlide 30
Hal Daumé III ([email protected])
New Task: SummarizationGive me information about
the Falkland islands war
Argentina was still obsessed with the Falkland Islands even in 1994, 12 years after its defeat in the 74day war with Britain. The country's overriding foreign policy aim continued to be winning sovereignty over the islands.
That's too muchinformation to read!
The Falkland islandswar, in 1982, wasfought betweenBritain and Argentina.
That's perfect!
Standard approach is sentence extraction, but that is often deemed too “coarse” to producegood, very short summaries. We wish to also drop words and phrases => document compression
Structured Learning for LanguageSlide 31
Hal Daumé III ([email protected])
Structure of Search
S1 S2 S3 S4 S5 Sn...
= frontier node = summary node
➢ Lay sentences out sequentially➢ Generate a dependency parse of each sentence➢ Mark each root as a frontier node➢ Repeat:
➢ Choose a frontier node node to add to the summary➢ Add all its children to the frontier➢ Finish when we have enough words
To make search more tractable, we run an initial round of sentence extraction @ 5x length
The man ate a big sandwich
The
man
ate
a big
sandwich
Argentina was still obsessed with the Falkland Islands even in 1994, 12 years after its defeat in the 74day war with Britain. The country's overriding foreign policy aim continued to be winning sovereignty over the islands.
Optimal Policy not analytically available;Approximated with beam search.
§6.1
(Sum
mariz
ation
: Vine
-Gro
wth M
odel)
Structured Learning for LanguageSlide 32
Hal Daumé III ([email protected])
Example Output (40 word limit)Sentence Extraction + Compression:
Argentina and Britain announced an agreement, nearly eight years after they fought a 74day war a populated archipelago off Argentina's coast. Argentina gets out the red carpet, official royal visitor since the end of the Falklands war in 1982.
Vine Growth:Argentina and Britain announced to restore full ties, eight years after they fought a 74day war over the Falkland islands. Britain invited Argentina's minister Cavallo to London in 1992 in the first official visit since the Falklands war in 1982.
6 Diplomatic ties restored 3 Falkland war was in 19825 Major cabinet member visits 3 Cavallo visited UK5 Exchanges were in 1992 2 War was 74days long3 War between Britain and Argentina
+24
+13
§6.7
(Sum
mariz
ation
: Erro
r Ana
laysis
)
Structured Learning for LanguageSlide 33
Hal Daumé III ([email protected])
Results
Approx. Optimal Vine Growth
Vine Growth (Searn)
Optimal Extract @100
Extract @100
Baseline2
2.5
3
3.5
4
[Daumé III and Marcu, DUC 2005]
§6.6
(Sum
mariz
ation
: Exp
erim
ental
Res
ults)
Structured Learning for LanguageSlide 34
Hal Daumé III ([email protected])
Talk Outline➢ Searn: Searchbased Structured Prediction
➢ Experimental Results:➢ Sequence Labeling➢ Entity Detection and Tracking➢ Automatic Document Summarization
➢ Discussion and Future Work
Structured Learning for LanguageSlide 35
Hal Daumé III ([email protected])
What Can Searn Do?
Solve structured prediction problems...
...efficiently.
...with theoretical guarantees.
...with weak assumptions on structure.
...and outperform the competition.
...with little extra code required.
§7 (C
onclu
sions
and F
uture
Dire
ction
s)
Structured Learning for LanguageSlide 36
Hal Daumé III ([email protected])
What Can Searn Not Do? (Yet)
Solve structured prediction problems...
...with only weak feedback.
...with hidden variables.
...with enormous branching factors.
...without predefined search spaces.
...with “combinatorial” outputs.
...with only an optimal path.
§7 (C
onclu
sions
and F
uture
Dire
ction
s)
Structured Learning for LanguageSlide 39
Hal Daumé III ([email protected])
Conditional Random Fields[Lafferty, McCallum, Pereira, ICML 2001]
Maxmargin Markov Networks[Taskar, Guestrin, Koller, NIPS 2003]
Searn[Daumé III, Langford, Marcu, submitted to NIPS 2006]
Incremental Perceptron[Collins & Roark, EMNLP 2004]
LaSO[Daumé III, Marcu, ICML 2005]
Optimal Policies
Optimal Path
Approximately Optimal Policy
Optimal Policy
Weak Feedback
Lossaugmented Search
Normalized Search
Optimally Completable
⊇ ⊇⊃ ⊃
⊃ ⊃⊃ ⊃
⊃ ⊃⊃ ⊃
Most NLP Problems