learning phonetic similarity for matching named entity translation and mining new translations wai...

Learning Phonetic Similarity for Matching Named Entity Translation and Mining

New Translations

Wai Lam, Ruizhang Huang, Pik-Shan CheungACM SIGIR 2004

Introduction

Discovering translation pairs of different languages, especially for named entities

Focusing on Chinese-English NE translation Combining both phonetic and semantic

information, while previous studies most dealt with single evidence only

Matching Model

Segmenting NE candidates into tokens Computing token-to-token similarity score based on

either phonetic information or semantic information Treating matching problem as a weighted bipartie m

atching Finding the maximum weighted bipartie matching as

the similarity measure between two NE candidates

Tokenization & Semantic Similarity Score

Looking up each token of an English candidate in the bilingual dictionary provided by LDC.

Scanning the Chinese candidate to get those segments that can maximally match any of the Chinese translations in the dictionary

The semantic similarity score is defined as the number of matched characters divided by the total characters of the corresponding translation.

The unmatched English terms are concatenated with other adjacent unmatched terms into one token, and so are the unmatched Chinese segments

For example:– English candidate: Palo Alto Chamber of Commerce– Chinese candidate: 帕洛奧托商會– The Chinese translation of “commerce” is “ 商業” , and the segment “ 商” can maxi

mally match this translation, so “ 商” would be segmented as a token– Then, the semantic similarity score between “commerce” and “ 商” is Len(“ 商” ) / L

en(“ 商業” ) = 1 / 2 = 0.5.– The unmatched terms “Palo” and “Alto” are concatenated into one token.– Likewise, the unmatched segment “ 帕洛奧托” is treated as a single token.

Phonetic Similarity Score

Getting the phonetic representation of English and Chinese candidates For example, “father” would be transformed to “faDR”, “ 港” would be

transformed to “gang3”. Splitting the phonetic representations into basic phoneme units.

– Note: There’s some questions about the original paper. Building a phoneme pronunciation similarity (PPS) table Treating the problem as a weighted longest common subsequence pro

blem Finding the optimal longest common subsequence Normalizing the score of the optimal solution by dividing the maximum

length of two sequences Using the normalized score as the phonetic similarity score of two repr

esentations

Learning Phonetic Similarity

Using 20,000 English-Chinese person name pairs from C-E NE Corpus provided by LDC

The names are transformed into basic phoneme units through the procedure mentioned above.

The target is to maximize:

The Widrow-Hoff Algorithm

Minimize:

Zk is set to 1 for positive training samples and 0 for negative ones.

Processing one pair of entities at each iteration, and using the following equation to update PPS table:

Using a validation set to implement the terminating condition

The Exponentiated-Gradient Algorithm

The top level framework of EG is similar to WH EG requires V belonging to the probability simplex. Therefore, during training, V is always maintained as

a probability simplex. However, before being used to estimate similarity sc

ore, the elements in V is magnified as:

And the updating formula is given by:

The Genetic Algorithm

A chromosome represents all the elements in the PPS table.

Each gene in a chromosome corresponds to a particular element in the table.

An initial population of chromosomes is prepared. Standard genetic operators such as crossover and

mutation are employed. The target is to maximize:

Experiments

Experiment 1:– 2,000 person name pairs different from training and

validation data are used to evaluate the learning performance.

– The learning rate of WH algorithm is set to 5e10-5.– The learning rate of EG algorithm is set to 0.01.– The crossover rate and mutation rate of genetic algorithm

is tuned to 0.8 and 0.0001 respectively.

Experiments

Experiment 2:– 1,000 named entities are collected to evaluate the

performance of the overall NE matching model.– The pure phonetic model and the pure semantic model are

also conducted for comparison

Mining New Entity Translations

Unsupervised learning technique using a bilingual dictionary is employed to detect comparable news.

People names, place names, and organization names are automatically extracted from the news content.

For each NE, computing its cognate weight, which represents the NE’s importance in the new cluster

Making both use of NE matching model and cognate weight to discover new translations


If both of the English and Chinese named entities are of relatively high cognate weights in a particular news cluster, they are more likely to be matched.

The formula for measuring the cognate weight similarity score, Sw(E,C), is defined as follows:

The final similarity score Sf (E,C) for E and C is given as follows:


News articles from 20 November 2003 to 20 December 2003 were collected.

There are 1,599 English news and 2,476 Chinese news in total.

Each news batch contains news from four consecutive days, resulting in 28 batches in total.

Comparable news clusters are generated for each batch.

αh was set to 0.8, and the threshold φ was set to 0.5.


There are in total 128 unseen name translations discovered.

Suppose we only consider those discovered Chinese NE with the corresponding English entity appeared in the output.

The average ARR across all 28 days for all the named entities was 0.960.

The ARR for person names was 0.952 and the ARR for place and organization names was 0.984.

learning phonetic similarity for matching named entity translation and mining new translations wai...

Documents