multi-core structural svm training kai-wei chang department of computer science university of...
TRANSCRIPT
1
Multi-core Structural SVM Training
Kai-Wei ChangDepartment of Computer ScienceUniversity of Illinois at Urbana-Champaign
Joint Work With Vivek Srikumar and Dan Roth
2
Decisions are structured in many applications. Global decisions in which several local decisions play a role but there
are mutual dependencies on their outcome. It is essential to make coherent decisions in a way that takes the
interdependencies into account. E.g., Part-of-Speech tagging (sequential labeling)
Input: a sequence of words Output: Part-of-speech tags {NN, VBZ, …}
“A cat chases a mouse” => “DT NN VBZ DT NN”. Assignment to can depend on both and . Feature vector defined on both input and output variables:
can be “: Cat”, “ VBZ-VBZ”.
Motivation
3
Inference with General Constraint Structure [Roth&Yih’04,07]
Recognizing Entities and Relations
Dole ’s wife, Elizabeth , is a native of N.C.
E1 E2 E3
R12 R2
3
other 0.05
per 0.85
loc 0.10
other 0.05
per 0.50
loc 0.45
other 0.10
per 0.60
loc 0.30
irrelevant 0.10
spouse_of 0.05
born_in 0.85
irrelevant 0.05
spouse_of 0.45
born_in 0.50
irrelevant 0.05
spouse_of 0.45
born_in 0.50
other 0.05
per 0.85
loc 0.10
other 0.10
per 0.60
loc 0.30
other 0.05
per 0.50
loc 0.45
irrelevant 0.05
spouse_of 0.45
born_in 0.50
irrelevant 0.10
spouse_of 0.05
born_in 0.85
other 0.05
per 0.50
loc 0.45
Improvement over no inference: 2-5%
4
Structured prediction: predicting a structured output variable y based on the input variable x. variables form a structure: sequences, clusters, trees, or arbitrary
graphs. Structure comes from interactions between the output variables
through mutual correlations and constraints. TODAY:
How to efficiently learn models that are used to make global decisions. We focus on training a structural SVM model. Various approaches have been proposed [Joachims et. al. 09, Chang
and Yih 13, Lacoste-Julien et. al. 13] but they are single-threaded. DEMI-DCD: a multi-threaded algorithm for training structural SVM.
Structured Learning and Inference
5
Structural SVM: Inference and Learning DEMI-DCD for Structural SVM Related Work Experiments Conclusions
Outline
6
Structural SVM: Inference and Learning DEMI-DCD for Structural SVM Related Work Experiments Conclusions
Outline
7
Features vector on input-output
Inference constitutes predicting the best scoring structure. Efficient inference algorithms have been proposed for some
specific structures. Integer linear programing (ILP) solver can deal with general
structures.
Set of allowed structuresoften specified by constraints
Weight parameters (to be estimated during learning)
Features on input-output
Structured Prediction: Inference
8
Given a set of training examples . Solve the following optimization problem to learn w
is the scoring function used in inference. We use (L2-loss structural SVM).
Structural SVM
Score of gold structure
Score ofpredicted structure
Loss function
Slack variable
For all samples and feasible structures
9
Quadratic programming with bounded constraints
where Number of variables can be exponentially large. Relationship between and
For linear model: maintain the relationship between and though out the learning process [Hsieh et.al. 08].
Dual Problem of Structural SVM
Corresponds to different
10
Maintain an active set of dual variables: Identify that will be non-zero in the end of optimization process.
In a single-thread implementation, training consists of two phases: Select and maintain (active set selection step).
Require solving a loss-augmented inference problem for each
Solving loss-augmented inferences is usually the bottleneck. Update the values of (learning step).
Require (approximately) solving a sub-problem. Related to cutting-plane methods.
Active Set
11
Structural SVM: Inference and Learning DEMI-DCD for Structural SVM Related Work Experiments Conclusions
Outline
DEMI-DCD: Decouple Model-update and Inference with Dual Coordinate Descent. Let be #threads, and split training data into Active set selection (Inference) thread : select and maintain
the active set for each example in Learning thread: loop over all examples and update model .
and are shared between threads using shared memory buffers.
Overview of DEMI-DCD
Learning
Active Set Selection
12
Sequentially visit each instance and update To update , solve the following one variable sub-problem
Then and Shrinking heuristic: remove from if = 0 and
Learning Thread
13
DEMI-DCD requires little synchronization. Only learning thread can write , and only one active set
selection thread can modify . Each thread maintains a local copy of and copies to/from a
shared buffer after iterations.
Synchronization
14
15
Structural SVM: Inference and Learning DEMI-DCD for Structural SVM Related Work Experiments Conclusions
Outline
A Master-Slave architecture (MS-DCD): Given processors: split data into parts. At each iterations:
Master sends current model to slave threads. Each slave thread solves loss-augmented inference problems
associated with a data block and updates the active set. After all slave threads finish, master thread updates the model
according to the active set. Implemented in JLIS.
A parallel Dual Coordinate Descent Algorithm
Master
Slave Slave Slave Slave
Sent current w
Solve loss-augmented inferenceand update A
Master Update w based on A
16
Structured Perceptron [Collins 02]: At each iteration: pick an example and find the best
structured output according to the current model . Then update with a learning rate
SP-IPM [McDonald et al 10]:1. Split data into parts.2. Train Structured Perceptron on each data block in parallel.3. Mixed the models using a linear combination.4. Repeat Step 2 and use the mixed model as the initial model.
Structured Perceptron and its Parallel Version
17
18
Structural SVM: Inference and Learning DEMI-DCD for Structural SVM Related Work Experiments Conclusions
Outline
19
Experiment Settings
POS tagging (POS-WSJ): Assign POS label to each word in a sentence. We use standard Penn Treebank Wall Street Journal corpus with
39,832 sentences. Entity and Relation Recognition (Entity-Relation):
Assign entity types to mentions and identify relations among them. 5,925 training samples. Inference is solved by an ILP solver.
We compare the following methods: DEMI-DCD: the proposed method. MS-DCD: A master-slave style parallel implementation of DCD. SP-IPM: parallel structured Perceptron.
20
Convergence on Primal Function ValueRelative primal function value difference along training time
POS-WSJ Entity-RelationLog-scale
21
Test PerformanceTest Performance along training time
POS-WSJ
SP-IPM converges to a different model
22
Test PerformanceTest Performance along training time Entity-Relation Task
Entity F1 Relation F1
23
Moving Average of CPU Usage
POS-WSJ Entity-Relation
DEMI-DCD fully utilizes CPU power
CPU usage drops because of the synchronization
25
Structural SVM: Inference and Learning DEMI-DCD for Structural SVM Related Work Experiments Conclusions
Outline
26
Conclusion
We proposed DEMI-DCD for training structural SVM on multi-core machine.
The proposed method decouples the model update and inference phases of learning.
As a result, it can fully utilize all available processors to speed up learning.
Software will be available at: http://cogcomp.cs.illinois.edu/page/software
Thank you.