t06 screen
TRANSCRIPT
-
7/27/2019 T06 Screen
1/75
Revisiting Dimensionality Reduction Techniques for NLP
J a g ad eesh J a g ar l amud i1 Ra ghavend r a U dupa2
(1) University of Maryland, College Park, Maryland, USA(2) Microsoft Research Lab India Private Limited, Bangalore, India
ABSTRACTMany natural language processing (NLP) applications represent words and documents as
vectors in a very high dimensional space. The inherently high-dimensional nature of these
applications leads to sparse vectors resulting in poor performance of downstream applications.
Dimensionality reduction aims to find a lower dimensional subspace (or simply subspace)
which captures the essential information required by the downstream applications. Althoughit received a lot of attention in the beginning, its popularity in NLP, unlike other fields such
as computer vision, has declined over the time. This is partly because, traditionally, it was
studied in an unsupervised fashion and hence the learnt subspace may not be optimal for the
task at hand. But recent advances in learning low-dimensional representations in the presence
of input and output variables enables us to learn task-specific subspaces that are as effective
as the state-of-the-art approaches. In this tutorial, we aim to demonstrate the simplicity and
effectiveness of these techniques on a diverse set of NLP tasks. By the end of the tutorial, we
hope the attendees would be able to decide "whether or not dimensionality reduction can help
their task and if so, how?".
The tutorial begins with an introduction to dimensionality reduction and its impor-
tance to NLP. As many of the dimensionality reduction techniques discussed in the tutorial make
use of Linear Algebra, we discuss some important concepts including Linear Transformation,
Positive Definite Matrices, eigenvalues and eigenvectors. Next, we look at some important
dimensionality reduction techniques for data with single view (PCA, SVD, OPCA). We then
take up applications of these techniques to some important NLP problems (word-sense
discrimination, POS tagging and Information Retrieval). As NLP often involves more than one
language, we look at dimensionality reduction of multiview data using Canonical CorrelationAnalysis. We discuss some interesting applications of multiview dimensionality reduction
(bilingual document projections and mining word-level translations). We also discuss some
advanced topics in dimensionality reduction such as Non-Linear and Neural techniques and
some application inspired techniques such as Discriminative Reranking, Supervised Semantic
Analysis, and Multilingual Hashing.
We do not assume attendees know anything about dimensionality reduction (though
the tutorial should be interesting even to those who know some), but we *do* assume some
basic knowledge of linear algebra.
-
7/27/2019 T06 Screen
2/75
Road Map
Introduction
NLP and Dimensionality Reduction
Mathematical Background
Data with Single View
Techniques
Applications
Advanced Topics
Data with Multiple Views
Techniques
Applications
Advanced Topics
Summary
-
7/27/2019 T06 Screen
3/75
-
7/27/2019 T06 Screen
4/75
Dimensionality Reduction:
Motivation
Many applications involve high dimensional
(and often sparse) data
High dimensional data poses severalchallenges
Computational
Difficulty of Interpretation
Over Fitting
However, data often lies (approximately) in a
low dimensional manifold embedded in high
dimensional manifold
Dimensionality Reduction:
Goal
Given high dimensional data, discover the
underlying low dimensional structure
2D Embedding
560 dimensional data
He et al, Face Recognition Using LaplacianFaces
-
7/27/2019 T06 Screen
5/75
Dimensionality Reduction:
Benefits
Computational Efficiency
K-Nearest Neighbor Search
Data Compression
Less storage; millions of data points in RAM
Data Visualization
2D and 3D Scatter Plots
Latent Structure and Semantics
Feature Extraction
Removing distracting variance from data sets
Dimensionality Reduction:
Techniques
Projective Methods
find low dimensionalprojections that extract
useful
information from the data, by maximizinga suitable objective function
PCA, ICA, LDA
Manifold Modeling Methods
find low dimensional subspace that best
preserves the manifold structure in the data, bymodelling the manifold structure
LLE, Isomap, Laplacian Eigenmaps
-
7/27/2019 T06 Screen
6/75
Dimensionality Reduction:
Relevance to NLP
High dimensional data in NLP
Text Documents
Context Vectors
How can Dimensionality Reduction help?
Z^uv][]u]o]}(}uv
Correlate semantically related terms
Crosslingual similarity
-
7/27/2019 T06 Screen
7/75
-
7/27/2019 T06 Screen
8/75
Data Centering
Dataset::L T5 T 4H Mean:L 5 T@5 Centering:TL T F Centered dataset::L T5 T Mean after centering:
L 5 T@5 L 5 T F @5 L F L r Mean after linear transformation:
5
#T L 5 # T F L #@5 F#L r@5
Data Variance
Dataset::L T5 T 4H Centered:
L 5
T@5
Lr
Variance:5
T
6 L@5 6N 5:: L 6N% where %L 5::(sample covariance)
Transformed dataset:#:
Variance after transformation:5
#T
6@5 L 5 6N#::# L 6N#%#
v]vP}v[
change data variance
-
7/27/2019 T06 Screen
9/75
Positive Definite Matrices
Real: / 4H
Square:L L M Symmetric:/L / Positive: T/TP rfor all TM r Examples:
Identity Matrix
s ss w
%,#%#
Cholesky decomposition: /L ..
Eigenvalues and Eigenvectors
/ 4H /Q
LQwhere Qis a vector and is a scalar
eigenvectorQ, eigenvalue
-
7/27/2019 T06 Screen
10/75
Eigensystem of Positivedefinite
Matrices
/ 4H Positive eigenvalues:P r Real valued eigenvectors:Q 4
Orthonormal eigenvectors:M QQLrand Q
QL s: 77 L +; Full rank:4=JG / L L Eigen decomposition:/ L 7&7
Data Variance and Eigenvalues
Centered dataset::L T5 T 4
Data variance:
5
T6
L
@5 6N% Eigen decomposition: %L 7&7 Data variance:
5
T
6 L@5 6N%L
-
7/27/2019 T06 Screen
11/75
-
7/27/2019 T06 Screen
12/75
Principal Components Analysis (PCA)
Centered Dataset::L T5 T T 4 Goal:Find orthonormal linear transformation
6 4
\4
that maximizes data variance 6 T L #T ## L + 6N#%#
Mathematical formulation:
#
L H@
6N#%#
Linear transformation
Orthonormal basis
Data variance
-
7/27/2019 T06 Screen
13/75
PCA: Solution
Eigen decomposition of %:
%L 7&7
7 L Q5 Q6 Q & L @E=C 5 6 5R 6R R
# L Q5 Q6 Q
6 T L #TL Q5 Q6 Q
T MATLAB function: LNEJ?KIL:;
PCA: Solution (contd.)
Data variance after transformation:
#: L Q5 Q6 Q :
#::
#
L Q5 Q6 Q
::
Q5 Q6 Q L Q5 Q6 Q 7&7 Q5 Q6 Q L @E=C 5 6
6N#::# L @5 Contribution ofth component to data
variance:
8-
-
7/27/2019 T06 Screen
14/75
PCA: Properties
PCA decorrelates the dataset
#%#
L@E=C 5 6 PCA gives rank k reconstruction withminimum squared error
PCA is sensitive to the scaling of the original
features
-
7/27/2019 T06 Screen
15/75
Singular Value Decomposition (SVD)
Dataset::L T5 T T 4 :L 7-8 L QR@5
NL N=JG : 7 4H such that 77 L +(left singular vectors) 8 4Hsuch that 88L + (right singular vectors) - L @E=C 5 4H (singular values)
:L QR
@5 Low rank approximation:
:L 7-8 L QR@5 -L @E=C 5 r r GQ @
SVD and Data Sphering
Centered dataset::L T5 T T 4 :L 7-8 L QR@5 :: L 7-67L 6QQ@5 Note that
5
Q
6QQ
@5
5
Q L s
Let 7L Q5 Q 8L R5 R -L @E=C 5 GQ N
-?57
:: 7- L + #::# L +where# L -?57 The linear transformation# L -?57
decorrelates the data set
-
7/27/2019 T06 Screen
16/75
SVD and Eigen Decomposition
Dataset::
L T5 T T 4
: L 7-8 :: L 7-88-7 L 7-67(eigen decomposition) :: L 8-77-8 L 8-68(eigen decomposition) SVD and PCA:
SVD on centered:is the same as PCA on :
-
7/27/2019 T06 Screen
17/75
Oriented Principal Components
Analysis (OPCA)
Generalization of PCA
Along with signal covariance %, a noise
covariance% is available When%L +(white noise) OPCA = PCA
Seeks projections that maximize the ratio of the
variance of the signal projected to the variance
of the noise
Mathematical formulation:
# L H
@
6N#%#
OPCA: Solution
Generalized eigenvalue problem:
%7 L %7&
Equivalent eigenvalue problem:%?56%%?568L 8&where 8L %567
7 L Q5 Q6 Q & L @E=C 5 6 5R 6R R
# L Q5 Q6 Q
6 T L #TL Q5 Q6 Q T MATLAB function:AEC:;
-
7/27/2019 T06 Screen
18/75
OPCA: Properties
Projections remain the same when the noise
and signal vectors are globally scaled with
two different scale factors
Projected data is not necessarily uncorrelated
Can be extended to multiview data [Platt et
al, EMNLP 2010]
-
7/27/2019 T06 Screen
19/75
Popular Feature Space Models
Vector Space Model
Document is represented as bag-of-words
Features: words
Feature weight: TF(S @) or some variant
Word Space Model
Word is represented in terms of its context words
Features: words (with or with out the position)
Feature weight: Freq(S S)
Turney and Pantel 2010
-
7/27/2019 T06 Screen
20/75
Curse of dimensionality
We have observations T 4
@is usually very huge
Vector Space Models
@= vocab size (number of words in a language)
Word Space Models
@= vocab size (if position is ignored)
@= H where L is window length
Curse of dimensionality
-
7/27/2019 T06 Screen
21/75
-
7/27/2019 T06 Screen
22/75
Word Sense Discrimination
Aim:Cluster contexts based on the meaning
Steps:
1. Represent a word as a point in vector space
Dimensionality Reduction
2. Represent each context as a point
3. Cluster the points using a clustering algorithm
Vector Space Use words as the features
Feature weight is co-occurrence strength
1. Word Vectors
2. Context
Vectors
3. Sense Vectors
-
7/27/2019 T06 Screen
23/75
-
7/27/2019 T06 Screen
24/75
Reduce the dimensionality of word vectors
9 L 7-8 9 Z Q5 Q
Word Sense Discrimination:
Dimensionality Reduction
legal Clothes Y
judge 210 75 Y
robe 50 250 Y
law 240 50 Y
suit 147 157 Y
dismisses 96 152 Y
9 L
Word Sense Discrimination :
Results & Discussion
Averaged results on 20 words
Accuracy
6, terms 76
6, SVD 90
Frequency, terms 81
Frequency, SVD 88
Schtze 1998
-
7/27/2019 T06 Screen
25/75
Part-of-Speech (POS) Tagging
Given a sentence label words with their POS tags
Unsupervised Approaches
Attempt to cluster words
Align each cluster with a POS tag
Do not assume a dictionary of tags
I ate an apple .
NN VB DT NN .
Schtze 1995, Lamar et al 2010
-
7/27/2019 T06 Screen
26/75
Part-of-Speech Tagging
Steps
1. Represent words in appropriate vector space
Dimensionality Reduction
2. Cluster using your favorite algorithm
Vector space should capture syntactic properties
Use most frequent :@;words as features
Frequency of a word in the context as feature weight
Part-of-Speech Tagging :
Pass 1
Construct left and right context matrices
. and 4matrices of size 8
H@
Dimensionality Reduction
Get rank N5approximation
&L .
4
is a 8 H tN5 matrix Run weighted G-means on &with G5clusters
.L 7-8
. L 7-
. ZNormalized .
4L 7-84 L 7- 4 ZNormalized 4
-
7/27/2019 T06 Screen
27/75
Part-of-Speech Tagging :
Pass 2
The clusters are not optimal because of sparsity
Construct .and 4of size 8 H G5 Dimensionality Reduction
Get rank N6approximation
&L . 4 is a 8 H tN6 matrix Run weighted G-means on &
. L 7-8
. L 7-
. ZNormalized .
4 L 7-84 L 7- 4 ZNormalized 4
Part-of-Speech Tagging :
Results
Penn Treebank (1.1 M tokens, 43K types)
17 and 45 tags
M-to-1 accuracies
PTB17 PTB45
SVD2 0.730 0.660
HMM-EM 0.647 0.621
HMM-VB 0.637 0.605
HMM-GS 0.674 0.660
HMM-Sparse(32) 0.702 (2.2) 0.654 (1.0)
VEM:sr?5 sr?5; 0.682 (0.8) 0.546 (1.7)
Lamar et al 2010
-
7/27/2019 T06 Screen
28/75
Part-of-Speech Tagging :
Discussion
Sensitivity to parameters
Scaling with singular values
G-means algorithm
Weighted G-means
Clusters are initialized to most frequent word types
Non-disambiguating tagger
Very simple algorithm
-
7/27/2019 T06 Screen
29/75
Information Retrieval
Rank documents @in response to a query M
Vector Space Model
Query and doc. are represented as bag-of-words
Features: Words Feature Weight: TFIDF
Lexical Gap
Polysemy and Synonymy
Information Retrieval :
Lexical Gap
Term HDocument matrix %
ship 1 1
boat 1
ocean 1 1
voyage 1 1 1
trip 1 1
TFIDF weightingis better ?
-
7/27/2019 T06 Screen
30/75
Information Retrieval :
Latent Semantic Analysis
Term
HDocument matrix %
H
Steps:
1. Dimensionality Reduction of term Hdoc. matrix2. Folding-in queries
M ZB:M;
3. Compute semantic similarity, score M @
-
7/27/2019 T06 Screen
31/75
Information Retrieval :
Latent Semantic Analysis
Term HDocument matrix %H Steps:
1. Dimensionality Reduction
2. Folding-in queries
@L 7-@@L -?57@ ML -
?57M
%L 7-8
-
7/27/2019 T06 Screen
32/75
Information Retrieval :
Latent Semantic Analysis
Term
HDocument matrix %
H
Steps:
1. Dimensionality Reduction
2. Folding-in queries
3. Semantic similarity
denotes dot product
ML -?57M
5?KNA M@
Z M
@ L
M @
M@
%L 7-8
Deerwester 1988; Dumais 2005
Information Retrieval :
Lexical Gap Revisited
Term HDocument matrix %
New document representations
ship 1 1
boat 1
ocean 1 1
voyage 1 1 1
trip 1 1
ship -1.62 -0.60 -0.44 -0.97 -0.70 -0.26
boat -0.46 -0.84 -0.30 1.00 0.35 0.65
-
7/27/2019 T06 Screen
33/75
Information Retrieval :
Results & Discussion
Term
HDocument matrix %
Fold-in new documents as well Deviates from the optimal as we add more docs.
Probabilistic Latent Semantic Analysis
MED CRAN CACM CISI
Cos+tfidf 49 35.2 21.9 20.2
LSA 64.6 38.7 23.8 21.9
PLSI-U 69.5 38.9 25.3 23.3
PLSI-Q 63.2 38.6 26.6 23.1
Hofmann 1999
-
7/27/2019 T06 Screen
34/75
-
7/27/2019 T06 Screen
35/75
Non-linear Dimensionality Reduction
Laplacian Eigenmaps
Weight matrix 9with similarities
Local neighbourhood
&L 9 and .L & F 9
Q.Q s.t. Q&QL +.Q L &Q
Q.QL 9 Q F Q 6
-
7/27/2019 T06 Screen
36/75
-
7/27/2019 T06 Screen
37/75
-
7/27/2019 T06 Screen
38/75
Neural Embeddings
Word is represented as a vector of size I
Learning
Optimize such that log-likelihood is maximized
Gradient ascent
Learns parameters and word vectors simultaneously
Learned word-vectors capture semantics
Learn to perform multiple tasks simultaneously
Bengio et al 2003; Collobert and Weston 2008
-
7/27/2019 T06 Screen
39/75
Canonical Correlation Analysis (CCA)
Centered dataset:
:
L T5 T 4
-H ;
L U5 U 4
.H Project:and ;along = 4-and > 4.
O L =T5 =T , PL >U5 >U Data correlation after transformation:
?KO O P L
L 8-
.
8-
.
8-
L =:;>
=::= >;;>
-
7/27/2019 T06 Screen
40/75
CCA (contd.)
Covariance matrices:
%L :; %L ::, %L ;; Correlation in terms of covariance matrices:
?KO O P L =%>
=%= >%>
Directionsthat maximize data correlation:
= > L
=%>
=%= >%>
CCA: Formulation
Goal: Find linear transformations # $ that
maximize data correlation
Optimization problem:# $ L
6N#:;$
O P
6N#::# L s6N$
;;
$L s
-
7/27/2019 T06 Screen
41/75
CCA: Solution
Generalized eigenvalue problem:
%$L %#&%
# L %$& Can be shown that &L &L &$L %?5% #&?5
%%?5
%
# L %#&6
MATLAB function:?=JKJ?KNN:;
-
7/27/2019 T06 Screen
42/75
-
7/27/2019 T06 Screen
43/75
Bilingual Document Projections
Applications:
Comparable and Parallel Document Retrieval
Cross-language text categorization
Steps:
1. Represent each document as a vector
Two different vector spaces, one per language
2. Use CCA to find linear transformations :# $;3. Find new aligned documents using #and $
Bilingual Document Projections
Steps:
1. Represent each document as a vector
Vector Space:
Features: Most frequent 20K content words
Feature weight: TFIDF weighting
Training Data:
T 4- bag of English words
U 4. bag of Hindi words
T U EL s J :L T5T6 T ;L U5U6 U
-
7/27/2019 T06 Screen
44/75
Bilingual Document Projections
Steps:
1. Represent each document as a vector
2. Use CCA to find linear transformations#and $
3. Find new aligned documents using #and $
Scoring:
Score(T U; Z #T $U L
-
7/27/2019 T06 Screen
45/75
Bilingual Document Projections :
Results & Discussion
Accuracy MRR
OPCA 72.55 77.34
Word-by-word 70.33 74.67
CCA 68.94 73.78
Word-by-word (5000) 67.86 72.36
CL-LSI 53.02 61.30
Untranslated 46.92 53.83
CPLSA 45.79 51.30
JPLSA 33.22 36.19
Platt et al 2010
-
7/27/2019 T06 Screen
46/75
Mining Word-Level Translations
Training Data:Word level seed translations
Task:Mine translations for new wordsdvo]}v}(^]o]_M
Resources:monolingual comparable corpora
English Spanish P(s|e)
state estado 0.5
state declarar 0.3
society sociedad 0.4
society compaa 0.35
company sociedad 0.8
Mining Word-Level Translations
Applications:
Lexicon induction for resource poor languages
Mining translations for unknown words in MT
Steps:
1. W]v]vP}(^}]_
2. Represent each word as vector
Two different feature spaces, one per language
3. Use CCA to find transformations #and $
4. Use#and $to mine new word translations
-
7/27/2019 T06 Screen
47/75
Mining Word-Level Translations
Steps:
1. W]v]vP}(^}]_
2. Represent each word as a vector
Vector Space
Features: context words (WSM); Orthography
Feature Weights: TFIDF weights
Can be computed using ONLY comparable corpora
T U EL s J; :L T5T6 T ; ;L U5U6 U
-
7/27/2019 T06 Screen
48/75
Mining Word-Level Translations
Steps:
1. W]v]vP}(^}]_
2. Represent each word as a vector
3. Use CCA to find transformations #and $
4. Use#and $to mine new word translations
Scoring
Score(A O; L #T $U L
-
7/27/2019 T06 Screen
49/75
Mining Word-Level Translations :
Results & Discussion
Seed lexicon size 100
Bootstrapping
Results are lower for other language pairs
Best-
EditDist 58.6 62.6 61.1 47.4
Ortho 76.0 81.3 80.1 52.3 55.0
Context 91.1 81.3 80.2 65.3 58.0
Both 87.2 89.7 89.0 89.7 72.0
Haghighi et al 2008
Mining Word-Level Translations :
Results & Discussion
Mining translations for unknown words
OOV words for MT domain adaptation
Ne
wsEm
eaSu
bsPHP
Baseline 23.00 26.62 10.26 38.67
+ve change 0.80 1.44 0.13 0.28German
Baseline 27.30 40.46 16.91 28.12
+ve change 0.36 1.51 0.61 0.68French
MT Accuracies (BLEU)
Daum and Jagarlamudi 2011
-
7/27/2019 T06 Screen
50/75
-
7/27/2019 T06 Screen
51/75
Supervised Semantic Indexing
Task:Learn to rank ads =for a given doc. @
Training Data:
Pairs of :@ =>;webpages and clicked ads
Randomly chosen pairs @ =?
Steps :
1. Represent an ad =and a doc. @as vectors
2. Learn scoring functionB:= @;
3. Rank ads for a given document
Bai et al 2009
Supervised Semantic Indexing
Steps :
1. Represent ads and docs. as vectors
Vector Space
Bag-of-word representation
Features: words
Feature weights: TFIDF weight
=and @are vectors of size 8
-
7/27/2019 T06 Screen
52/75
Supervised Semantic Indexing
Steps :
1. Represent ads and docs. as vectors
2. Learn scoring functionB:= @;
Scoring function
B = @ L @9=Parameters: 8 H 8
9L + 9L & 9L 7
8 E + 9L 7
7 E +
Cosine
Similarity
Reweighting
of words
Dimensionality Reduction
Different treatment for
ads and documents
Dimensionality Reduction
SAME treatment for ads
and documents
Supervised Semantic Indexing :
Learn Scoring Function
Max-MarginB @ =>
FB @ =?
Ps
Objective
rs FB @ => EB:@ =?;67
Sub Gradient Descent
-
7/27/2019 T06 Screen
53/75
Supervised Semantic Indexing
Steps:
1. Represent ads and docs. as vectors
2. Learn scoring functionB:= @;
3. Rank ads for a give document
Ranking Ads
Compute score usingB = @ and rank
Supervised Semantic Indexing :
Results & Discussion
1.9 M pairs for training
100K pairs for testing
Parameters Rank Loss
TFIDF 45.60
SSI: 9L 78 54 E + 50H10k 25.83SSI: 9L 78 64 E + 50H20k 26.68SSI: 9L 78 74 E + 50H30k 26.98
Bai et al 2009
-
7/27/2019 T06 Screen
54/75
Supervised Semantic Indexing :
Results & Discussion
Ranking wikipedia pages for queries
Rank Loss
Performs better when training data is big
K=5 K=10 K=20
TFIDF 21.6 14.0 9.14
LSI + :s F ;TFIDF 14.2 9.73 6.36SSI: 9L 77 74 E + 4.80 3.10 1.87SSI: 9L 78 74 E + 4.37 2.91 1.80
Bai et al 2009
-
7/27/2019 T06 Screen
55/75
-
Discriminative Reranking
Reranker operates in the outer product space [Szedmak et al., 2006; Wang et al., 2007]
Weight vector is constrained [Bai et al. 10]
T5
T6
T7
T8
Y
U5
U6
U7
U8
Y
T& U&
T5U5 T5 U6 T U5 T U6
55 56 h5 h6
SL =>
@5 @6 Vector of
length @5 H @6
-
7/27/2019 T06 Screen
56/75
Low-Dimensional Reranking
Find#and $s.t.
:#T $U;U
T U
U5U6 U7
T U
# $U6
Low-Dimensional Reranking
Find#and $s.t.
:#T $ U; U
1. Score: =TU>
2. Add constraints to penalize incorrect ones
O?KNA T U R O?KNA T U E s F
IR s F
Idea
[Tsochantaridis et al. 04]
-
7/27/2019 T06 Screen
57/75
Discriminative Model
L . Repeat
# $ ZSoftened-Disc
IL=T U> F =T U> L s F I . L r P r= If P r
@L I F s F Z F @
End Until convergence
// Initialization
// Get the current soln.
// Compute margins
// Potential Slack
// Compute Slack
// Update theLagrangian variables
6ODFN GRHVQW FKDQJH
-
7/27/2019 T06 Screen
58/75
POS Tagging
Combine with Viterbi score
Interpolation parameter is tuned
Training
Input sentence and Reference tag sequences
Candidates, Score and Loss values
Testing-7.0514 NNS VBD IN TO DT NNS NN .
-0.1947 NNS VBD RP TO DT NNS NN .-6.8068 NNS VBD RB TO DT NNS NN .
-7.1408 NNS VBD RP TO DT NNS VB .
-13.752 NNS VBD RB TO DT NNS VB .
Buyers stepped in to the futures pit .Score
POS Tagging
Combine with Viterbi score
Interpolation parameter is tuned
Data Statistics
Results
English Chinese French Swedish
Baseline 96.15 92.31 97.41 93.23
Collins 96.06 92.81 97.35 93.44
Regularized 96.00 92.88 97.38 93.35
Oracle 98.39 98.19 99.00 96.48
-
7/27/2019 T06 Screen
59/75
POS Tagging
Combine with Viterbi score
Interpolation parameter is tuned
Data Statistics
Results
English Chinese French Swedish
Baseline 96.15 92.31 97.41 93.23
Collins 96.06 92.81 97.35 93.44
Regularized 96.00 92.88 97.38 93.35
Oracle 98.39 98.19 99.00 96.48
Softened-Disc 96 32 92.87 97 53 93.24
Discriminative 96.3 92 91 97 53 93.36
POS Tagging
Zo}v]vY
Interpolation with Viterbi score is crucial
Softened-Disc
Independent of no. training examples Easy to code and can be solved exactly
English Chinese French Swedish
Softened-Disc+0.17 +0.56 +0.12 +0.01
Discriminative +0.15 +0.6 +0.12 +0.13
Softened-Disc* + 92 +4.31 +1 12 +0.08
Discriminative* +0.88 +4 77 +0.9 + 73
Jagarlamudi and Daum 2012
-
7/27/2019 T06 Screen
60/75
-
7/27/2019 T06 Screen
61/75
Similarity Search: Challenges
Computing nearest neighbors in high
dimensions using geometric search
techniques is very difficult
All methods are as bad as brute force linear
search which is expensive
Approximate techniques such as ANN perform
efficiently in dimensions as high as 20; in higher
dimensions, the results are rather spotty
Need to do search on commodity hardware
Cross-language search
-
7/27/2019 T06 Screen
62/75
-
7/27/2019 T06 Screen
63/75
What is the advantage?
Scales easily to very large databases
Compact language-independent representation
32 bits per object
Search is effective and efficient
Hamming nearest-neighbor search
Few milliseconds per query for searching amillion objects (single thread on a singleprocessor)
-
7/27/2019 T06 Screen
64/75
What is the challenge?
Language/Script Independent HashCodes
Learning Hash Functions from
Training Data
-
7/27/2019 T06 Screen
65/75
-
7/27/2019 T06 Screen
66/75
-
7/27/2019 T06 Screen
67/75
-
7/27/2019 T06 Screen
68/75
Learning Hash Functions: Summary
Given a set of parallel names as training data, findthe top K projection vectors for each language
using Canonical Correlation Analysis.
Each projection vector gives a 1-bit hash function.
Hash code for a name can be computed byprojecting its feature vector on to the projectionvectors followed by binarization.
Udupa & Kumar, 2010
Fuzzy Name Search: Experimental Setup
Test Sets:
DUMBTIONARY
1231 misspelled names
INTRANET200 misspelled names
Name Directories:
DUMBTIONARY
550K names from Wikipedia
INTRANET
150K employee names
Training Data:15K pairs of single token names in English and Hindi
Baselines:
Two popular search engines, Double Metaphone, BM25
-
7/27/2019 T06 Screen
69/75
-
7/27/2019 T06 Screen
70/75
Multilingual: Experimental Setup
Test Sets
1000 multi-word names each in Russian, Hebrew,
Kannada, Tamil, Hindi
Name Directory:
English Wikipedia Titles
6 Million Titles, 2 Million Unique Words
Baseline:
State-of-the-art Machine Transliteration (NEWS2009)
-
7/27/2019 T06 Screen
71/75
-
7/27/2019 T06 Screen
72/75
Dimensionality Reduction
0
100
200
300
400
500
600
700
800
1990 1995 2000 2005 2010
#ofdim.
reductionpapers
Vision NLP
Dimensionality Reduction
0
0.2
0.4
0.6
0.8
1
1.2
1990 1995 2000 2005 2010
Vision NLP
Popularity compared toBayesian approaches
-
7/27/2019 T06 Screen
73/75
Summary
Dimensionality reduction has merits for NLP
Computational and Feature correlations
Has been explored in unsupervised fashion
But recent novel developments
For multi-view data
If you can formulate your problem as mapping
Try dimensionality reduction
Can solve for the global optimum
Summary
Spectral Learning
Provides way to learn global optimum forgenerative models
Enriching the existing models
Using word embeddings instead of words
Scalability of the techniques
}v[v}vZvu}(uo
Large scale SVD
-
7/27/2019 T06 Screen
74/75
-
7/27/2019 T06 Screen
75/75
References Jagadeesh Jagarlamudi, Hal Daum, III , Low-Dimensional Discriminative Reranking, in HLT-NAACL 2012
Shaishav Kumar and Raghavendra Udupa, Learning Hash Functions for Cross-View Similarity Search, in
IJCAI-11, IJCAI, 20 July 2011
Raghavendra Udupa and Shaishav Kumar, Hashing-based Approaches to Spelling Correction of PersonalNames, in Proceedings of EMNLP 2010, October 2010
Raghavendra Udupa and Mitesh Khapra, Transliteration Equivalence using Canonical Correlation Analysis,
in ECIR 2010, 2010
Jagadeesh Jagarlamudi, Hal Daum, III , Regularized Interlingual Projections: Evaluation on Multilingual
Transliteration, in Proceedings of EMNLP-CoNLL 2012.