sentence alignment using p-nnt and gmm

15
Sentence alignment using P-NNT and GMM Mohamed Abdel Fattah a, * , David B. Bracewell a , Fuji Ren a,b , Shingo Kuroiwa a a Faculty of Engineering, University of Tokushima, 2-1 Minamijosanjima, Tokushima 770-8506, Japan b School of Information Engineering, Beijing University of Posts and Telecommunications, Beijing 100088, China Received 4 January 2006; received in revised form 25 December 2006; accepted 24 January 2007 Available online 4 February 2007 Abstract Parallel corpora have become an essential resource for work in multilingual natural language processing. However, sen- tence aligned parallel corpora are more efficient than non-aligned parallel corpora for cross-language information retrieval and machine translation applications. In this paper, we present two new approaches to align English–Arabic sentences in bilingual parallel corpora based on probabilistic neural network (P-NNT) and Gaussian mixture model (GMM) classifiers. A feature vector is extracted from the text pair under consideration. This vector contains text features such as length, punc- tuation score, and cognate score values. A set of manually prepared training data was assigned to train the probabilistic neural network and Gaussian mixture model. Another set of data was used for testing. Using the probabilistic neural net- work and Gaussian mixture model approaches, we could achieve error reduction of 27% and 50%, respectively, over the length based approach when applied on a set of parallel English–Arabic documents. In addition, the results of (P-NNT) and (GMM) outperform the results of the combined model which exploits length, punctuation and cognates in a dynamic framework. The GMM approach outperforms Melamed and Moore’s approaches too. Moreover these new approaches are valid for any languages pair and are quite flexible since the feature vector may contain more, less or different features, such as a lexical matching feature and Hanzi characters in Japanese–Chinese texts, than the ones used in the current research. Ó 2007 Elsevier Ltd. All rights reserved. Keywords: Sentence alignment; English/Arabic parallel corpus; Parallel corpora; Probabilistic neural network; Gaussian mixture model 1. Introduction Recent years have seen a great interest in bilingual corpora that are composed of a source text along with a translation of that text in another language. Nowadays, bilingual corpora have become an essential resource for work in multilingual natural language processing systems (Moore, 2002; Gey et al., 2002; Davis and Ren, 1998), including data-driven machine translation (Dolan et al., 2002), bilingual lexicography, automatic trans- lation verification, automatic acquisition of knowledge about translation (Simard, 1999), and cross-language information retrieval (Chen and Gey, 2001; Oard, 1997). It is required that the bilingual corpora be aligned. Given a text and its translation, an alignment is a segmentation of the two texts such that the nth segment of one text is the translation of the nth segment of the other (as a special case, empty segments are allowed, and 0885-2308/$ - see front matter Ó 2007 Elsevier Ltd. All rights reserved. doi:10.1016/j.csl.2007.01.002 * Corresponding author. Tel.: +81 088 625 1545. E-mail address: mohafi@is.tokushima-u.ac.jp (M.A. Fattah). Computer Speech and Language 21 (2007) 594–608 www.elsevier.com/locate/csl COMPUTER SPEECH AND LANGUAGE

Upload: mohamed-abdel-fattah

Post on 26-Jun-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Sentence alignment using P-NNT and GMM

COMPUTER

Computer Speech and Language 21 (2007) 594–608

www.elsevier.com/locate/csl

SPEECH AND

LANGUAGE

Sentence alignment using P-NNT and GMM

Mohamed Abdel Fattah a,*, David B. Bracewell a, Fuji Ren a,b, Shingo Kuroiwa a

a Faculty of Engineering, University of Tokushima, 2-1 Minamijosanjima, Tokushima 770-8506, Japanb School of Information Engineering, Beijing University of Posts and Telecommunications, Beijing 100088, China

Received 4 January 2006; received in revised form 25 December 2006; accepted 24 January 2007Available online 4 February 2007

Abstract

Parallel corpora have become an essential resource for work in multilingual natural language processing. However, sen-tence aligned parallel corpora are more efficient than non-aligned parallel corpora for cross-language information retrievaland machine translation applications. In this paper, we present two new approaches to align English–Arabic sentences inbilingual parallel corpora based on probabilistic neural network (P-NNT) and Gaussian mixture model (GMM) classifiers.A feature vector is extracted from the text pair under consideration. This vector contains text features such as length, punc-tuation score, and cognate score values. A set of manually prepared training data was assigned to train the probabilisticneural network and Gaussian mixture model. Another set of data was used for testing. Using the probabilistic neural net-work and Gaussian mixture model approaches, we could achieve error reduction of 27% and 50%, respectively, over thelength based approach when applied on a set of parallel English–Arabic documents. In addition, the results of (P-NNT)and (GMM) outperform the results of the combined model which exploits length, punctuation and cognates in a dynamicframework. The GMM approach outperforms Melamed and Moore’s approaches too. Moreover these new approaches arevalid for any languages pair and are quite flexible since the feature vector may contain more, less or different features, suchas a lexical matching feature and Hanzi characters in Japanese–Chinese texts, than the ones used in the current research.� 2007 Elsevier Ltd. All rights reserved.

Keywords: Sentence alignment; English/Arabic parallel corpus; Parallel corpora; Probabilistic neural network; Gaussian mixture model

1. Introduction

Recent years have seen a great interest in bilingual corpora that are composed of a source text along with atranslation of that text in another language. Nowadays, bilingual corpora have become an essential resourcefor work in multilingual natural language processing systems (Moore, 2002; Gey et al., 2002; Davis and Ren,1998), including data-driven machine translation (Dolan et al., 2002), bilingual lexicography, automatic trans-lation verification, automatic acquisition of knowledge about translation (Simard, 1999), and cross-languageinformation retrieval (Chen and Gey, 2001; Oard, 1997). It is required that the bilingual corpora be aligned.Given a text and its translation, an alignment is a segmentation of the two texts such that the nth segment ofone text is the translation of the nth segment of the other (as a special case, empty segments are allowed, and

0885-2308/$ - see front matter � 2007 Elsevier Ltd. All rights reserved.doi:10.1016/j.csl.2007.01.002

* Corresponding author. Tel.: +81 088 625 1545.E-mail address: [email protected] (M.A. Fattah).

Page 2: Sentence alignment using P-NNT and GMM

M.A. Fattah et al. / Computer Speech and Language 21 (2007) 594–608 595

either corresponds to translator’s omissions or additions) (Simard, 1999; Christopher and Kar, 2004). Withaligned sentences, further analysis such as phrase and word alignment analysis (Fattah et al., 2006a; Kerand Chang, 1997; Melamed, 1997), bilingual terminology and collocation extraction analysis can be per-formed (Dejean et al., 2002; Thomas and Kevin, 2005).

In the last few years, much work has been reported in sentence alignment using different techniques.Length-based approaches [length as a function of sentence characters (Gale and Church, 1993) or sentencewords (Brown et al., 1991)] are based on the fact that longer sentences in one language tend to be translatedinto longer sentences in the other language, and that shorter sentences tend to be translated into shorter sen-tences. These approaches work quite well with a clean input, such as the Canadian Hansards corpus, whereasthey do not work well with noisy document pairs (Thomas and Kevin, 2005). Cognate based approaches werealso proposed and combined with the length-based approach to improve the alignment accuracy (Simardet al., 1992; Melamed, 1999; Danielsson and Muhlenbock, 2000; Ribeiro et al., 2001).

Sentence cognates such as digits, alphanumerical symbols, punctuation, and alphabetical words have beenused. However all cognate based approaches are tailored to close Western language pairs. For disparate lan-guage pairs, such as Arabic and English, with a lack of a shared Roman alphabet, it is not possible to rely onthe aforementioned cognates to achieve high-precision sentence alignment of noisy parallel corpora (however,cognates may be efficient when used with some other approaches). Some other sentence alignment approachesare text based approaches such as the hybrid dictionary approach (Collier et al., 1998), part-of-speech align-ment (Chen and Chen, 1994), and the lexical method (Chen, 1993). While these methods require little or noprior knowledge of source and target languages and give good results, they are relatively complex and requiresignificant amounts of parallel text and language resources.

Instead of a one-to-one hard matching of punctuation marks in parallel texts, as used in the cognateapproach of Simard et al. (1992), Thomas and Kevin (2005) allowed no match and one-to-several matchingof punctuation matches. However, neither Simard nor Thomas took into account the text length betweentwo successive cognates (Simard’s case) or punctuations (Thomas’s case), which increased the system confu-sion and lead to an increase in execution time and a decrease in accuracy. We have avoided this drawback bytaking the probability of text length between successive punctuation marks into account during the punctua-tion matching process, as will be shown in the following sections.

In this paper, we present non-traditional approaches for English–Arabic sentence alignment. For sentencealignment, we may have a 1–0 match, where one English sentence does not match any of the Arabic sentences,a 0–1 match where one Arabic sentence does not match any English sentences. The other matches we focus onare 1–1, 1–2, 2–1, 2–2, 1–3 and 3–1.

There may be more categories in bi-texts, but they are rare. Therefore, we consider only the previously men-tioned categories. If the system finds any other categories they will automatically be misclassified. As illus-trated above, we have eight sentence alignment categories. As such, we can consider sentence alignment asa classification problem, which may be solved by using a probabilistic neural network or Gaussian mixturemodel classifiers.

The paper is organized as follows: Section 2 introduces English–Arabic text features. Section 3 illustratesour new approaches. Section 4 discusses English–Arabic corpus creation. Section 5 shows the experimentalresults. Finally, Section 6 gives concluding remarks and discusses future work.

2. English–Arabic text features

Many features can be extracted from any text pair. The most important feature is the text length, since Galeand Church achieved good results using this feature. They used the fact that ‘‘longer sentences in one languagetend to be translated into longer sentences in the other language, and shorter sentences tend to be translatedinto shorter sentences’’.

The second text feature is punctuation marks. When presenting the same content in two different languages,translators exhibit a strong tendency to use the same punctuation structure in the text pairs as much as pos-sible. Take the following example, taken from the set of United Nations documents, to illustrate this fact:

Some 40% indicated the programme of work was under review, while 47% had not reviewed it.

Page 3: Sentence alignment using P-NNT and GMM

Table 1Matching punctuation marks

English punctuation mark % , % .Arabic punctuation mark % ‘ % .

596 M.A. Fattah et al. / Computer Speech and Language 21 (2007) 594–608

Table 1 shows the punctuation marks that match between the previous two sentences. This example showsa perfect matching, which does not occur in all English–Arabic document pairs. Some English punctuationmarks are not matched with any Arabic punctuation mark and vice versa. Hence we can classify punctuationmatching into the following categories:

(A) 1–1 matching type, where one English punctuation mark matches one Arabic punctuation mark.(B) 1–0 matching type, where one English punctuation mark does not match any of the Arabic punctuation

marks.(C) 0–1 matching type, where one Arabic punctuation mark does not match any of the English punctuation

marks.

It is also clear from Table 1 that some English punctuation marks have a different shape when they are usedin Arabic, such as the comma ‘‘,’’ in English has the form ‘‘ ‘ ’’ in the Arabic. These punctuation marks can beused to construct a probabilistic model to align English–Arabic document pairs. However, not all English–Arabic sentence pairs contain punctuation marks. Hence, if we use punctuation marks only as a text feature,the recall may be very low. It is more convenient to use punctuation in addition to other text features in orderto have both good precision and recall.

The probability that a sequence of punctuation marks APi = Ap1Ap2. . .. . .Api in an Arabic language texttranslates to a sequence of punctuation marks EPj = Ep1Ep2. . .. . .Epj in an English language text isP(APi,EPj). The system searches for the punctuation alignment that maximizes the probability over all pos-sible alignments given a pair of punctuation sequences corresponding to a pair of parallel sentences fromthe following formula:

arg maxAL

P ðALjAP i;EP jÞ: ð1Þ

Since ‘‘AL’’ is a punctuation alignment. Assume that the probabilities of the individually aligned punctuationpairs are independent. After applying the Bayes’ rule the following formula may be considered:

P ðAP i;EP jÞ ¼YAL

P ðApk;EpkÞ; ð2Þ

where P(Apk,Epk) = the probability of matching Apk with Epk, and it may be calculated as follows:

P ðApk;EpkÞ ¼Number of punctuation pairðApk;EpkÞ

Total number of punctuation pairs in the manually aligned data:

Applying Eq. (2) without taking the length between successive punctuation marks into consideration maycause confusion for the system and will decrease the accuracy of matching. Take the following English sen-tence and its Arabic translation (taken from one UN document pair) example to illustrate this point:

Many species, including, inter alia, river dolphins and porpoises, freshwater seals, manatees, hippopota-muses, the Asian water buffalo, otters, the European mink, the fishing cat and the flat-headed cat, the desmans(Desmana moschata or Russian desman and the Pyrenean desman (Galemys pyrenaicus)) and the well knownsemi-aquatic beavers, are threatened or endangered, mainly from habitat loss and degradation, pollution,overexploitation or entrapment in nets and other fishing gear.

Page 4: Sentence alignment using P-NNT and GMM

M.A. Fattah et al. / Computer Speech and Language 21 (2007) 594–608 597

In the above example, if we directly match punctuation marks without taking the length between them intoaccount, the second English ‘‘,’’ will be matched with the second Arabic ‘‘,’’ which is not correct. However, ifwe take the length between punctuation marks into account, we will see that the text length between the firstand the second English ‘‘,’’ is much smaller than the text length between the first and the second Arabic ‘‘,’’.Therefore, we will add the term P(dk jmatch) to Eq. (2) in order to take the effect of the text length betweensuccessive punctuation marks into account. Hence we can approximate Eq. (3) as follows:

P ðAP i;EP jÞ ¼YAL

P ðApk;EpkÞ � PðdkjmatchÞ; ð3Þ

P(dkjmatch) is the length-related probability distribution function. dk is a function of the text length (textlength between punctuation marks Epk and Epk�1) of the source language and the text length (text length be-tween punctuation marks Apk and Apk�1) of the target language.

P(dkjmatch) is derived straight from Gale and Church’s (1993), using the normality hypotheses as follows:

P ðdkjmatchÞ ¼ 2ð1� PðjdkjÞÞ; P ðdkÞ ¼1

2p

Z dk

�1e�z2=2 dz; dk ¼ ðle� la � cÞ=

ffiffiffiffiffiffiffiffiffiffiffiffila � s2p

;

le and la are the character lengths of the two portions of text under consideration (Gale and Church, 1993).The parameter c is the expected number of characters in le per character in la and it is determined empir-

ically from 1000 English–Arabic sentence pairs extracted from manually aligned UN (United Nations) docu-ments. We could estimate c from the following equation:

c = # of characters in English paragraphs/#of characters in Arabic paragraphs = 1.12s2 is the variance of the number of characters in le per character in la and it is estimated from the relation

between the square of the paragraph length difference and the Arabic paragraph length. The model assumesthat s2 is proportional to length. The constant of proportionality is determined by the slope of the robustregression line. The result for English–Arabic is s2 = 22.9. We have manually aligned the punctuation marksin 1000 English–Arabic sentence pairs to calculate each punctuation mark pair probability. The following is anexample of one English–Arabic punctuation mark pair probability:

Epk

Apk Frequency Probability

,

‘ 1121 0.186

After specifying the punctuation alignment that maximizes the probability over all possible alignments givena pair of punctuation sequences (using a dynamic programming framework as in (Gale and Church, 1993)), thesystem calculates the punctuation compatibility factor for the text pair under consideration as follows:

c ¼ cmaxðm; nÞ ;

where

c the punctuation compatibility factor,c the number of direct punctuation matches,n the number of Arabic punctuation marks,m the number of English punctuation marks.

The punctuation compatibility factor is considered as the second text pair feature.The third text pair feature is the cognate. For disparate language pairs, such as Arabic and English, that

lack a shared alphabet, it is not possible to rely only on cognates to achieve high-precision sentence alignmentof noisy parallel corpora.

However many UN and scientific Arabic documents contain some English words and expressions. Thesewords may be used as cognates. Take the previous example (English–Arabic sentence pair example) to illus-trate this fact. In the previous example, the words ‘‘otters’’, ‘‘Galemys’’ and ‘‘pyrenaicus’’ were used in the

Page 5: Sentence alignment using P-NNT and GMM

598 M.A. Fattah et al. / Computer Speech and Language 21 (2007) 594–608

Arabic sentence as they have no translation. These words may be used as cognates. We define the cognate fac-tor (cog) as the number of common items in the sentence pair. For instance, in the above example, the cognatefactor is cog = 3. When a sentence pair has no cognate words, the cognate factor is 0.

3. The proposed sentence alignment model

The classification framework of the proposed sentence alignment model has two modes of operation. First,is training mode where features are extracted from 7653 manually aligned English–Arabic sentence pairs andused to train a probabilistic neural network (P-NNT) and Gaussian mixture model (GMM). Second, is testingmode where features are extracted from the testing data and are aligned using the previously trained models.Alignment is done using a block of 3 sentences for each language. After aligning a source language sentenceand target language sentence, the next 3 sentences are then looked at. We have used 18 input units and 8 out-put units for P-NNT and GMM. Each input unit represents one input feature. The input feature vector X is asfollows:

X ¼ LðS1aÞLðS1eÞ ;

LðS1aÞ þ LðS2aÞLðS1eÞ ;

LðS1aÞLðS1eÞ þ LðSe2Þ ;

LðS1aÞ þ LðS2aÞLðS1eÞ þ LðS2eÞ ;

LðS1aÞ þ LðS2aÞ þ LðS3aÞLðS1eÞ ;

LðS1aÞLðS1eÞ þ LðS2eÞ þ LðS3eÞ ;

cðS1a; S1eÞ; cðS1a; S2a; S1eÞ; cðS1a; S1e; S2eÞ; cðS1a; S2a; S1e; S2eÞ;cðS1a; S2a; S3a; S1eÞ; cðS1a; S1e; S2e; S3eÞ; CogðS1a; S1eÞ; CogðS1a; S2a; S1eÞ;

CogðS1a; S1e; S2eÞ; CogðS1a; S2a; S1e; S2eÞ; CogðS1a; S2a; S3a; S1eÞ; CogðS1a; S1e; S2e; S3eÞ�;

whereL(X) is the length in characters of sentence X; c(X,Y) is the punctuation compatibility factor between sen-

tences X and Y; Cog(X,Y) is the cognate factor between sentences X and Y.The output is from 8 categories specified as follows. S1a! 0 means that the first Arabic sentence has no

English match. S1e! 0 means that the first English sentence has no Arabic match. Similarly, the remainingoutputs are S1a! S1e, S1a + S2a! S1e, S1a! S1e + S2e, S1a + S2a! S1e + S2e, S1a! S1e + S2e +S3e and S1a + S2a + S3a! S1e.

3.1. Probabilistic neural network

Probabilistic neural networks are a versatile and efficient tool to classify high-dimensional data (Specht,1990; Ganchev et al., 2003; Cain, 1990). The network weights and functions are backed by straightforwardBayesian probability, giving them an edge over other network models that have to be gradually optimizedusing techniques like gradient descent. Bayes’ theorem can be used to perform probabilistically optimal clas-sification as follows:

The probability distribution function (PDF) for a feature vector (X) to be of a certain category (class A forexample as one of the 8 output categories) is given by

faðX Þ ¼ 1=ð2pÞp=2rpð1=naÞXna

i¼1

expð�ðX � Y aiÞsðX � Y aiÞ=2r2Þ; ð4Þ

where

fa(X) the value of the PDF for class A at pointX.X test vector to be classified.i training vector number.p the training vector size.na number of training vectors in class A.

Page 6: Sentence alignment using P-NNT and GMM

M.A. Fattah et al. / Computer Speech and Language 21 (2007) 594–608 599

Yai ith training vector for class A.s transpose.r the standard deviation of the Gaussian curves used to construct the PDF.

Introducing a term to represent the relative number of trials in each category (na/ntotal) simplifies the expres-sion. Hence (1/na) term is canceled out as follows:

faðX Þ ¼ 1=ð2pÞp=2rpð1=ntotalÞXna

i¼1

expð�ðX � Y aiÞsðX � Y aiÞ=2r2Þ: ð5Þ

Terms common to all classes such as 1/(2p)p/2, rp and ntotal can also be eliminated, leaving the followingformula:

faðX ÞaXna

i¼1

expð�ðX � Y aiÞsðX � Y aiÞ=2r2Þ: ð6Þ

Hence the classifier can be expressed as follows:For a feature parameter X to belong to a category (r); the following formula must be verified:

X

i

expð�ðX � Y riÞsðX � Y riÞ=2r2ÞPX

i

expð�ðX � Y siÞsðX � Y siÞ=2r2Þ; ð7Þ

where (s) represents all other categories.The expression ðX � Y riÞsðX � Y riÞ ¼ ðX sX Þ � ð2X sY riÞ þ ðY s

riY riÞ ¼ 1� ð2X sY riÞ þ 1 ¼ 2� ð2X sY riÞ allow-ing formula (7) to be simplified as follows:

X

i

expððX sY ri � 1Þ=r2ÞPX

i

expððX sY si � 1Þ=r2Þ: ð8Þ

Fig. 1 shows the structure of the P-NNT implementation. Each neuron in layer one receives an element ofvector X to be classified as input (x1, x2, . . .xn) (n = 18 in our case). The weight matrix scaling these inputs isformed by the elements of the training vectors divided by the constant (r2). The first layer has a bias of �1/r2. The inputs of layer one are summed, producing (XsYri � 1). Then, this value is divided by r 2 and the

…….

……. ……

…….

…….

Yr11 Yr12 Yr1n

x1 x2 xn

Ys11 Ys12 Ys1n

x1 x2 …………….. xn

Yrm1 Yrm2 Yrmn

x1 x2 ……………..…………….. xn

BIAS -1/σ2

1Σ=

m

iexp((X τYsi-1) / σ2)

exp((X τYs1-1) / σ2 )exp((X τYrm-1) / σ2 )exp((X τYr1-1) / σ2 )

exp((X τYri-1) / σ2 ) Σm

i =1

Fig. 1. The structure of P-NNT implementation.

Page 7: Sentence alignment using P-NNT and GMM

600 M.A. Fattah et al. / Computer Speech and Language 21 (2007) 594–608

exponential transfer function is applied, resulting in outputs of exp((XsYri � 1)/r2) and exp((XsYsi � 1)/r2),where (s) represents the remaining categories. The second layer has eight neurons: each one is associatedwith a specific category of output mentioned before. The inputs from the first layer of each category aresummed to produce the expressions

Pmi¼1 expððX sY ri � 1Þ=r2) and

Pmi¼1 expððX sY si � 1Þ=r2). The output of

each neuron represents the probability that the vector X belongs to its class. The neuron in layer 2 withthe greatest output determines how the vector is classified.

3.2. Gaussian mixture model

The use of Gaussian mixture models as a classification tool is motivated by the interpretation that theGaussian components represent some general output dependent features and the capability of Gaussian mix-tures to model arbitrary densities (Reynolds, 1995; Pellom and Hansen, 1998; Fattah et al., 2006b).

The probability density function for a certain class (category) feature vector X is a weighted sum, or mix-

ture, of k class-conditional Gaussian distributions. For a given class model kc, the probability of observing X

is given by

pðX jkcÞ ¼XK

k¼1

wc;kNðX ;~lc;k;X

c;kÞ; ð9Þ

where wc,k,~lc;k;P

c;k are the mixture weight, mean, and covariance matrix, respectively, for the ith component,which has a Gaussian distribution given by

NðX ;~l;RÞ ¼ 1ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffið2pÞnjRj

p exp � 1

2ðX �~lÞs

X�1ðX �~lÞ� �

; ð10Þ

where n is the dimension of X. We usedP

as diagonal covariance matrices. Given a set of training vectors of acertain class, an initial set of means is estimated using k-means clustering. The mixture weights, means, andcovariances are then iteratively trained using the expectation maximization (EM) algorithm.

Using this approach, we constructed a class-dependent model for each category. After that we used allmodels for the sentence alignment task using the maximum likelihood of each category as follows:

For a given set of class-dependent reference models (k1,k2, . . . . . . . . .,k8) and one feature vector sequenceX = {x1,x2, . . . . . . . . .,xn}, the minimum error Bays’ decision rule is:

arg max16l68pðkljX Þ ¼ arg max16l68

pðX jklÞpðX Þ pðklÞ: ð11Þ

Using formula (11), a feature vector sequence X may be classified as one of the eight classes.

4. English–Arabic corpus

Although, there are very popular Arabic–English resources among the statistical machine translation com-munity that may be found in some projects such as the ‘‘DARPA TIDES program, http://www.ldc.upen-n.edu/Projects/TIDES/’’, we have decided to construct our Arabic–English parallel corpus from theInternet to have significant parallel data from different domains.

Arabic text is far behind on the Web’s exponential growth curve. Arabic text (as opposed to images) did notreally start emerging on the Web until the release of Microsoft Windows 98TM, which provided Arabic sup-port in its version of Internet Explorer. Unlike some language pairs such as English–Japanese where you canfind thousands of parallel articles, English–Arabic parallel documents are not easy to find in the Internet.Hence we have been forced to generate it from the Internet (specially using UN documents).

When presenting the same content in two different languages, translators exhibit a very strong tendency touse the same document structure. Resnik (Resnik and Smith, 2003), basing on this idea, used a techniquecalled STRAND (structural translation recognition, acquiring natural data) to collect parallel text from theinternet archive. The drawback of this approach is that a lot of significant parallel documents that exist on

Page 8: Sentence alignment using P-NNT and GMM

M.A. Fattah et al. / Computer Speech and Language 21 (2007) 594–608 601

the Internet are not of the same structure. We have avoided this problem when we extracted English/Arabicbitexts from the internet archive as follows:

Three steps are required to find parallel documents:

(1) Locating the pages that might contain parallel documents,(2) Generating the document pairs that might be a translation of each other,(3) Filtering-out of non-translation candidate pairs.

Unlike Resnik and Smith (2003) who used AltaVista search engine, Kraaij et al. (2003) used more than onesearch engine and his mining process is divided into two main steps: identification of candidate parallel pages,and verification of their parallelism. We have used different Internet search engines to collect bi-texts fromdifferent domains too. We simply sent queries to the different Internet search engines. These queries containedsome words like: ‘‘Arabic version’’, ‘‘English version’’, ‘‘Arabic’’, ‘‘English’’, ‘‘To Arabic’’, and ‘‘To English’’,in order to download the pages that might contain English/Arabic parallel documents. Then we collected twotypes of pages:

(a) A parent page is one that contains hypertext links to different-language versions of a document.(b) A sibling page is a page in one language that itself contains a link to a version of the same page in

another language. In each page, we download all the files in the English page and all the files in theArabic page as well. According to the documents name, for example an Arabic file called exp_a38,is most probable to be translation of the English file that is called exp_e38. We gathered the Eng-lish–Arabic document pairs that are most probably parallel. A length filter is used to filter the non-par-allel document pairs out. It is assumed that very long documents will not be translations of very shortdocuments and vice versa. To determine quantitative parameters for the length filter, 50 pairs of par-allel documents of varying lengths that were manually created were used. The word counts of thesedocuments showed that the length of the Arabic versions of the documents varied between 0.8 and1.2 times the length of the English versions. We have collected 652 document pairs that contain191,623 English sentences and 183,542 Arabic sentences. This collected corpus contains noisy text.Moreover the beginning and end of each paragraph is not easily specified in English and Arabic texts.In order to avoid accumulation of error during sentence alignment procedures, we have specified someanchor points in the English and Arabic texts based on some words or symbols that appeared at thebeginning of some sentences and had to be reasonably frequent. We have checked 15,652 English–Ara-bic sentences manually to extract a list of anchors. A sample of English–Arabic anchor pairs is shownin Table 2.

Usually combination of anchor pairs exists. For instance the English anchor (Article 5) and the Arabicanchor ( ) may appear in the parallel corpus. Some symbols may be deleted to have equivalent anchorpairs. For example the English anchor (Q 13:) to be equivalent to the Arabic anchor (: ), we should deletethe character (Q).

All of the English and Arabic texts in the corpus were aligned using the most frequent anchors appearing atthe beginning of some sentences according to the following algorithm:

Table 2A sample of English–Arabic anchor pairs

English anchor Arabic anchor

0, 1, 2, 3, 4, 5, 6, 7, 8, 9I, II, III, IV, V, VI, VII(a), (b), . . ., . . . . . . . . . . . . . . .

A, B, C, . . . . . . . . . . . . . . . . . .ArticleDecisionParagraph

Page 9: Sentence alignment using P-NNT and GMM

602 M.A. Fattah et al. / Computer Speech and Language 21 (2007) 594–608

RA: Read(Arabic sentence)

If (beginning of the sentence is not anchor)

GOTO RA

EndifRE: Read(English sentence)

If (beginning of the sentence is not anchor)

GOTO RE

Endif

If((Arabic anchor is equivalent to English anchor) && (0.8 < (Arabic length between successive anchors)/

(English length between successive anchors) < 1.2))Specify English and Arabic anchors as start of paragraph

EndifIf(not end of Arabic and English files)

GOTO RA

Endif

END

We may consider equivalent anchor pairs as the start of a paragraph. The final English–Arabic corpusbecame like the Canadian Hansards corpus, in that its paragraphs are easily aligned.

5. Experimental results

5.1. Length based approach

Let us consider Gale and Church’s length based approach. We constructed a dynamic programming frame-work to conduct experiments using their length based approach as a baseline experiment to compare with ourproposed system. First of all it was not clear to us, which variable should be considered as the text length,character or word? To answer this question, we had to do some statistical measurements on 1000 manuallyaligned English–Arabic sentence pairs, randomly selected from the previously mentioned corpus. We consid-ered the relationship between English paragraph length and Arabic paragraph length as a function of the num-ber of words. The results showed that, there is a good correlation (0.987) between English paragraph lengthand Arabic paragraph length. Moreover, the ratio and corresponding standard deviation were 0.9826 and0.2046, respectively. We also considered the relationship between English paragraph length and Arabic par-agraph length as a function of the number of characters.

The results showed a better correlation (0.992) between English paragraph length and Arabic paragraphlength. Moreover, the ratio and corresponding standard deviation were 1.12 and 0.1806, respectively. In com-parison to the previous results, the number of characters as the text length variable is better than words sincethe correlation is higher and the standard deviation is lower.

We applied the length based approach (using text length as a function of the number characters) experimenton a 1200 sentence pair sample, not taken from the training data. Table 3 shows the results. The first columnin Table 3 represents the category, the second column is the total number of sentence pairs related to this cat-egory, the third column is the number of sentence pairs that were misclassified and the fourth column is the

Table 3The results using length based approach

Category Frequency Error % Error

1–1 1099 54 4.9%1–0, 0–1 2 2 100%1–2, 2–1 88 14 15.9%2–2 2 1 50%1–3, 3–1 9 6 66%

Total 1200 77 6.4%

Page 10: Sentence alignment using P-NNT and GMM

M.A. Fattah et al. / Computer Speech and Language 21 (2007) 594–608 603

percentage of this error. Although 1–0, 0–1 and 2–2 are rare cases, we have taken them into consideration toreduce error. Moreover, we did not consider other cases like (3–3, 3–4, etc.) since they are very rare cases andconsidering more cases requires more computations and processing time. When the system finds these cases, itmisclassifies them.

5.2. Combined model using length, punctuation and cognate in a dynamic framework

We employed a sentence alignment algorithm, which maximizes the probability over all possible align-ments, given a pair of parallel texts, according to the following equation:

Table 4The va

Catego

1–11–0, 0–1–2, 2–2–21–3, 3–

arg maxAL

P ðALjA;EÞ; ð12Þ

where AL is the alignment, A and E are Arabic and English texts. The binomial distribution is a concept usedto describe a situation in which each of a number of outcomes necessarily results in either of two mutuallyexclusive outcomes. The statistical analysis of a hand-aligned portion of the Canadian Hansards revealed thatthe number of pairs of cognates that can be obtained from a pair of aligned segments of average size n (num-ber of candidate tokens per segment) approximately follows a binomial distribution (Simard et al., 1992).Therefore, we can use binomial distribution to approximate the value P(ALjA,E) as follows:

P ðALjA;EÞ ¼Yt

v¼1

P ðmatchÞ � PðdjmatchÞ �nv

rv

� �P ðApv;EpvÞ

rv � ð1� ðP ðApv;EpvÞÞÞnv�rv : ð13Þ

Since:

t the total number of sentences to be aligned.nv the maximum number of punctuation marks plus cognates in either the English text or the Arabic text

in the vth sentence to be aligned.rv the number of compatible punctuation marks and cognates in ordered comparison;P(Apv,Epv) the probability of the existence of a compatible punctuation mark and cognate in both languages;P(match) the match type probability of aligning English and Arabic text pair. P(match) is estimated using the

1000 English–Arabic sentence pairs that are manually aligned. Table 4 gives the possible values ofP(match).

And P(djmatch) = 2(1 � P(jdj)), P ðdÞ ¼ 12p

R d�1 e�z2=2 dz, d ¼ ðle� la � cÞ=

ffiffiffiffiffiffiffiffiffiffiffiffila � s2p

. (la, le, c and s2 are asmentioned before). Finally, the model uses Eq. (13) in a dynamic programming framework to specify Eng-lish–Arabic sentence pairs. Table 5 shows the results using the combined model.

5.3. Melamed’s geometric sentence alignment (GSA) approach

Melamed (Melamed, 1997, 1999) extended the smooth injective map recognizer (SIMR) to develop an algo-rithm called geometric sentence alignment (GSA). The smooth injective map recognizer (SIMR) is based on agreedy algorithm for mapping bi-text correspondence. Each bi-text defines a rectangular bi-text space. Thelower left corner of the rectangle is the origin of the bi-text space and it represents the beginning of two texts.The upper right corner is the terminus and it represents the end of two texts. The line between the origin andthe terminus is the main diagonal. The width and height of the rectangle are the lengths of the two component

lues of P(match) for each category

ry Frequency P(match)

915 0.9151 1 0.0011 75 0.075

1 0.0011 8 0.008

Page 11: Sentence alignment using P-NNT and GMM

Table 5The results using combined model

Category Frequency Error % Error

1–1 1099 45 4.1%1–0, 0–1 2 2 100%1–2, 2–1 88 11 12.5%2–2 2 1 50.0%1–3, 3–1 9 5 55.5%

Total 1200 64 5.3%

604 M.A. Fattah et al. / Computer Speech and Language 21 (2007) 594–608

texts, in characters. Each bi-text space contains a number of true points of correspondence (TPCs), other thanthe origin and the terminus. Since distances in the bi-text space are measured in characters, the position of atoken is defined as the mean position of its characters. TPCs exist at the corresponding boundaries of textunits such as sentences. Groups of TPCs with a roughly linear arrangement in the bi-text space are calledchains. For each bi-text, the true bi-text map (TPM) is the shortest bi-text map that runs through all the TPCs.SIMR considers only chains that are roughly parallel to the main diagonal. Arabic and English languages donot share the same alphabet, however cognates exist between them, but the correspondence points generatedbased on cognates are not enough signal for SIMR to achieve an accurate mapping. In this case anothermatching predicate is used to strengthen the signal by generating more correspondence points, which is thematching of punctuation marks. Table 6 shows the results using Melamed’s approach.

5.4. Moore’s approach

Moore’s algorithm (Moore, 2002) combines techniques adapted from previous work on sentence and wordalignment in a three-step process. He first aligned the corpus using a modified version of Brown et al.’s sen-tence-length-based model. He employed a search-pruning technique to efficiently find the sentence pairs thatalign with highest probability without the use of anchor points or larger previously aligned units. Next, heused the sentence pairs assigned the highest probability of alignment to train a modified version of theIBM Translation Model 1. Finally, he realigned the corpus, augmenting the initial alignment model withIBM Model 1; to produce an alignment based both on sentence length and word correspondences. The finalsearch is confined to the minimal alignment segments that were assigned a non-negligible probability accord-ing to the initial alignment model, which reduces the size of the search space. Moore used the following sen-tence alignment categories: 1–1, 1–2, 2–1, 1–0, and 0–1. Table 7 shows the results using Moore’s approach.

Table 6The results using Melamed’s approach

Category Frequency Error % Error

1–1 1099 51 4.6%1–0, 0–1 2 2 100%1–2, 2–1 88 12 13.6%2–2 2 1 50.0%1–3, 3–1 9 5 55.5%

Total 1200 71 5.9%

Table 7The results using Moore’s approach

Category Frequency Error % Error

1–1 1099 37 3.4%1–0, 0–1 2 2 100%1–2, 2–1 88 7 7.9%2–2 2 2 100%1–3, 3–1 9 9 100%

Total 1200 57 4.7%

Page 12: Sentence alignment using P-NNT and GMM

M.A. Fattah et al. / Computer Speech and Language 21 (2007) 594–608 605

5.5. Probabilistic neural network approach

The system extracted features from 7653 manually aligned English–Arabic sentence pairs and used them totrain a probabilistic neural network. 1200 English–Arabic sentence pairs were used as the testing data. Thesesentences were used as inputs to the probabilistic neural network after the feature extraction step. Alignmentwas done using a block of 3 sentences for each language. After aligning a source language sentence and targetlanguage sentence, the next 3 sentences were then looked at as follows:

(1) Extract features from the first three English sentences and do the same with the first three Arabicsentences.

(2) Construct the feature vectorX.(3) Use this feature vector as an input of the neural network.(4) According to the network output, construct the second feature vector. For instance, if the result of the

network is S1a! 0, then read the fourth Arabic sentence and use it with the second and third Arabicsentences with the first three English sentences to generate the feature vector X.

(5) Continue using this approach until there are no more English–Arabic text pairs.

Table 8 shows the results when we applied this approach on the 1200 English–Arabic sentence pairs. It isclear from Table 8 that the results have been improved in terms of accuracy over the first three approaches.Additionally, we applied the P-NNT approach on the entire English–Arabic corpus containing 191,623 Eng-lish sentences and 183,542 Arabic sentences. Then we randomly selected 500 sentence pairs from the sentencealigned output file and manually checked them. The system reported a total error rate of 4.4%.

We decreased the number of sentence pairs used for training the P-NNT to 4000 sentence pairs to investi-gate the effect of the training data size on the total system performance. These 4000 sentence pairs were ran-domly selected from the training data. The constructed model was then used to align the entire English–Arabiccorpus. Then, we randomly selected 500 sentence pairs from the sentence aligned output file and manuallychecked them. The system reported a total error rate of 4.9%. The reduction of the training data set doesnot significantly change total system performance.

5.6. Gaussian mixture model approach

The system extracted features from 7653 manually aligned English–Arabic sentence pairs and used them toconstruct a Gaussian mixture model for each category, 8 in total. 1200 English–Arabic sentence pairs wereused as the testing data. Using formula (11) and the 5 steps mentioned in the previous section (usingGMM instead of P-NNT) the 1200 English–Arabic sentence pairs were aligned. Table 9 shows the resultswhen we applied this approach. Additionally, we applied the GMM approach on the entire English–Arabiccorpus. Then, we randomly selected 500 sentence pairs from the sentence aligned output and manuallychecked them. The system reported a total error rate of 3.4%.

We decreased the number of sentence pairs used for training the GMM to 4000 sentence pairs as in theprevious section to investigate the effect of the training data size on total system performance. These 4000 sen-tence pairs were randomly selected from the training data. The constructed model was then used to align the

Table 8The results using the probabilistic neural network

Category Frequency Error % Error

1–1 1099 40 3.6%1–0, 0–1 2 2 100%1–2, 2–1 88 9 10.2%2–2 2 1 50%1–3, 3–1 9 5 55.5%

Total 1200 57 4.7%

Page 13: Sentence alignment using P-NNT and GMM

Table 9The results using Gaussian mixture model

Category Frequency Error % Error

1–1 1099 28 2.5%1–0, 0–1 2 2 100%1–2, 2–1 88 5 5.6%2–2 2 1 50%1–3, 3–1 9 3 33.3%

Total 1200 39 3.2%

0.00%

1.00%

2.00%

3.00%

4.00%

5.00%

6.00%

7.00%

Lengthbased

Combinedmodel

Melamed'sapproach

Moore'sapproach

P-NNT GMM

% E

rror

Fig. 2. The total system performance for all approaches.

606 M.A. Fattah et al. / Computer Speech and Language 21 (2007) 594–608

entire English–Arabic corpus. Then, we randomly selected 500 sentence pairs from the sentence aligned outputand manually checked them. The system reported a total error rate of 3.3%. The reduction of the training dataset from 7653 to 4000 does not change the total system performance.

5.7. Discussion

The length based approach did not give bad results, as shown in Table 3 (the total error was 6.4%). This isexplained by the fact that there is a good correlation (0.992) and low standard deviation (0.1806) between Eng-lish and Arabic text lengths. These factors lead to good results as shown in Table 3. The combined model notonly used the text length, but also the punctuation and cognate matches in a dynamic framework. Because ofthis, the result of the combined model was better than that of the length based approach. Melame’s approachdid not have good results, since the text features used might not be enough for this approach. Moore’sapproach had good results. However it has a drawback, in that it does not support alignment categories 2–2, 1–3, and 3–1 which exist in our corpus. The probabilistic neural network approach decreased the total errorby 27% and the Gaussian mixture model approach decreased it by 50% compared to the length basedapproach. Moreover, they have better results than the combined model as shown in Fig. 2. The Gaussian mix-ture model approach result outperforms Melamed and Moore approaches. The probabilistic neural networkand Gaussian mixture model approaches are quite flexible and they open the door to many other models to beused for sentence alignment problem.

Using feature extraction criteria allows researchers to use different feature values when trying to solve nat-ural language processing problems in general.

6. Conclusions

In this paper, we investigated the use of a probabilistic neural network and a Gaussian mixture model forsentence alignment. We have applied our new approaches on a sample of an English–Arabic parallel corpus.Our approach results outperform the length based, combined model and Melamed’s and Moore’s approaches.

Page 14: Sentence alignment using P-NNT and GMM

M.A. Fattah et al. / Computer Speech and Language 21 (2007) 594–608 607

The proposed approaches have improved total system performance in terms of effectiveness (accuracy). Theseapproaches decreased the total error by 27% and 50% using probabilistic neural network and Gaussian mix-ture model, respectively. Our approaches have used the feature extraction criteria, which gives researchers anopportunity to use many varieties of these features based on the language pairs used and their text types(Hanzi characters in Japanese–Chinese texts may be used for instance).

Acknowledgements

This research was partially supported by the Ministry of Education, Science, Sports and Culture, Grant-in-Aid for Scientific Research (B), 14380166 and 17300065, Exploratory Research 17656128 in 2005, Interna-tional Communications Foundation (ICF).

References

Brown, P., Lai, J., Mercer, R., 1991. Aligning sentences in parallel corpora. In: Proceedings of the 29th Annual Meeting of the Associationfor Computational Linguistics, Berkeley, CA, USA.

Cain, B.J., 1990. Improved probabilistic neural networks and its performance relative to the other models. Proceedings of SPIE,Applications of Artificial Neural Networks 1294, 354–365.

Chen, S.F., 1993. Aligning sentences in bilingual corpora using lexical information. In: Proceedings of ACL-93, Columbus, OH, pp. 9–16.Chen, K.H., Chen, H.H., 1994. A part-of-speech-based alignment algorithm. In: Proceedings of the 15th International Conference on

Computational Linguistics, Kyoto, pp. 166–171.Chen, A., Gey, F., 2001. Translation term weighting and combining translation resources in cross-language retrieval. TREC, 2001.Christopher, C., Kar, Li., 2004. Building parallel corpora by automatic title alignment using length-based and text-based approaches.

Information Processing and Management 40, 939–955.Collier, N., Ono, K., Hirakawa, H., 1998. An experiment in hybrid dictionary and statistical sentence alignment. COLING-ACL, 268–274.Danielsson, P., Muhlenbock, K., 2000. The misconception of high-frequency words in Scandinavian translation. AMTA, 158–168.Davis, M., Ren, F., 1998. Automatic Japanese–Chinese parallel text alignment. Proceedings of International Conference on Chinese

Information Processing, 452–457.Dejean, H., Gaussier, E., Sadat, F., 2002. Bilingual terminology extraction: an approach based on a multilingual thesaurus applicable to

comparable corpora. In: Proceedings of the 19th International Conference on Computational Linguistics COLING 2002, Taipei,Taiwan, pp. 218–224.

Dolan, W.B., Pinkham, J., Richardson, S.D., 2002. MSR-MT, the microsoft research machine translation system. AMTA, 237–239.Fattah, M., Ren, F., Kuroiwa, S., 2006a. Stemming to Improve Translation Lexicon Creation form Bitexts. Information Processing and

Management 42 (4), 1003–1016.Fattah, M., Ren, F., Kuroiwa, S., 2006b. Speaker recognition for wire/wireless communication systems. The international Arab Journal of

Information Technology ‘‘IAJIT’’ 3 (2), 26–32.Gale, W.A., Church, K.W., 1993. A program for aligning sentences in bilingual corpora. Computational Linguistics 19, 75–102.Ganchev, T., Tasoulis, D.K., Vrahatis, M.N., Fakotakis, N., 2003. Locally recurrent probabilistic neural networks for text independent

speaker verification. Proceedings of the EuroSpeech 3, 1673–1676.Gey, F.C., Chen, A., Buckland, M.K., Larson, R.R., 2002. Translingual vocabulary mappings for multilingual information access. SIGIR,

455–456.Ker, S.J., Chang, J.S., 1997. A class-based approach to word alignment. Computational Linguistics 23 (2), 313–344.Kraaij, W., Nie, J.Y., Simard, M., 2003. Embedding web-based statistical translation models in cross-language information retrieval.

Computational Linguistics 29 (3), 381–419.Melamed, I.D., 1997. A portable algorithm for mapping bitext correspondence: In: The 35th Conference of the Association for

Computational Linguistics (ACL 1997), Madrid, Spain.Melamed, I.D., 1999. Bitext maps and alignment via pattern recognition. Computational Linguistics, March 25 (1), 107–130.Moore, R.C., 2002. Fast and accurate sentence alignment of bilingual corpora. AMTA, 135–144.Oard, D.W., 1997. Alternative approaches for cross-language text retrieval. In: Hull, D., Oard, D. (Eds.), AAAI Symposium in Cross-

Language Text and Speech Retrieval. American Association for Artificial Intelligence, March.Pellom, B.L., Hansen, J.H.L., 1998. An efficient scoring algorithm for Gaussian mixture model based speaker identification. IEEE Signal

Processing Letters 5 (11), 281–284.Resnik, P., Smith, N.A., 2003. The web as a parallel corpus. Computational Linguistics 29 (3), 349–380.Reynolds, D., 1995. Speaker identification and verification using Gaussian mixture speaker models. Speech Communication 17, 91–108.Ribeiro, A., Dias, G., Lopes, G., Mexia, J., 2001. Cognates Alignment. In: Bente Maegaard (Ed.), Proceedings of the Machine Translation

Summit VIII (MT Summit VIII) – Machine Translation in the Information Age, Santiago de Compostela, Spain, pp. 287–292.Simard, M., 1999. Text-translation alignment: three languages are better than two. In: Proceedings of EMNLP/VLC-99, College Park,

MD.

Page 15: Sentence alignment using P-NNT and GMM

608 M.A. Fattah et al. / Computer Speech and Language 21 (2007) 594–608

Simard, M., Foster, G., Isabelle, P., 1992. Using cognates to align sentences in bilingual corpora. In: Proceedings of TMI92, Montreal,Canada, pp. 67–81.

Specht, D.F., 1990. Probabilistic neural networks. Neural Networks 3 (1), 109–118.Thomas, C., Kevin, C., 2005. Aligning parallel bilingual corpora statistically with punctuation criteria. Computational Linguistics and

Chinese Language Processing 10 (1), 95–122.