spsim

Measuring Spelling Similarityfor Cognate Identification

Luıs Gomes

Faculdade de Ciencias e Tecnologiada Universidade Nova de Lisboa

EPIA 2011, October 10, 2011, Lisboa

What are cognates?

In linguistics, cognates are words that have a commonetymological origin. – Wikipedia

Example

The words etymology (English) and etimologia (Portuguese) bothderive from Greek etymologıa through Latin etymologia.

EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes

What are cognates?

I am particularly interested in cognates of different languages, thatretain the same meaning, such as

German symbole themen operativeEnglish symbols themes operationalFrench symboles themes operationnelleSpanish sımbolos temas operativaPortuguese sımbolos temas operacionalItalian simboli temi operativa


What are cognates?

I am particularly interested in cognates of different languages, thatretain the same meaning, such as

German demokratische aspekte justizEnglish democratic aspects justiceFrench democratique aspects justiceSpanish democratica aspectos justiciaPortuguese democratica aspectos justicaItalian democratica aspetti giustizia


Extracting cognates from parallel corpora

Example parallel sentences

The Member States shall

coordinate their economic

policies within the Union .

Os Estados - Membros

coordenam as suas polıticas

economicas no ambito da

Uniao .

Spelling similarity

Cognates typically have similar spellings.

AssociationTranslations tend to co-occur systematically in parallel texts, whilenon-translations co-occur by chance.



Spelling similarity measures

I EDSim (Edit-Distance-based Similarity)

I LCSR (Longest Common Subsequence Ratio)

I and a few others . . .

Association measures

I Dice

I SCP (Symmetric Conditional Probability)

I Mutual-Information

I and many others . . .



Most commonly used spelling similarity measures

EDSim(w1,w2) = 1− ED(w1,w2)

max(w1,w2)

ED(w1,w2) is the Edit Distance between words w1 and w2.

LCSR(w1,w2) =LCS(w1,w2)

max(w1,w2)

LCS(w1,w2) is the length of the Longest Common Subsequencebetween words w1 and w2.



Problem with these measuresThey look at strings too literally!

EDSim(photographic, fotografica) = 0.5

LCSR(photographic, fotografica) = 0.58

The spelling similarity score should be closer to 1.0 to reflecthuman judgement.


How does SpSim work?

First we align the two strings to find differences

This takes O(w1w2) time, just like computing ED(w1,w2) orLCS(w1,w2).

ˆ ph o t o g r aph i c $

ˆ f o t o g r af i c o $

Then we check which differences we may ignore

I Is “ph f” in the hashtable?

I Is “aph af” in the hashtable?

I Is “ o” in the hashtable?

In learning mode we would insert these differences into thehastable instead.



Finally, we compute SpSim

SpSim(w1,w2) = 1−∑

i Di

w1 + w2

Di is the length of each difference that cannot be ignored.If no difference is ignored, then SpSim(w1,w2) = EDSim(w1,w2).



Problem: over-generalization

Some differences such as insert an “o” in the Portuguese word aretoo vague and may occur by chance (ie, even if the words aretotally unrelated).

ˆ ph o t o g r aph i c $

ˆ f o t o g r af i c o $



Solution: contextualize first and generalize afterwards

Contextualized differences are less likely to occur by chance.Example: insert an “o” at the end of the Portuguese word if theEnglish word ends with a “c”.

ˆpho t o g raphi c$

ˆfo t o g rafi co$

Whenever we find the same difference in a different context wemay generalize it.

ˆpha s e $

ˆfa s e $

“ˆpho ˆfo” + “ˆpha ˆfa” = “ˆph ˆf”


Experimental setup

Corpora

I used a parallel corpus of texts from the European Constitution infive language pairs.

Method

1. Obtain a list of putative cognates by thresholding anassociation measure (Dice).

2. Manually verify all putative cognates.

3. Compare the precision, recall and f-measure of SpSim andEDSim for a series of different threshold values.


Experimental setup

I used Dice to extract the initial list of putative cognates

Dice(x , y) =2 F(x , y)

F(x) + F(y)

F (x , y) is the number of co-occurrences in all parallel segments si :

F(x , y) =∑i

min(f(x , si ), f(y , si ))

F(x) and F(y) are the total number of occurrences of x and y inall parallel segments si :

F(x) =∑i

f(x , si ) ; F(y) =∑i

f(y , si )


Experimental setup

Extracted all pairs of words with Dice ≥ 0.6.

Summary of extraction and manual verification

Language Pair Accepted Rejected Total

German-English 269 878 1147English-Spanish 399 749 1148English-French 380 825 1205English-Portuguese 410 796 1206French-Italian 635 974 1609


Comparing SpSim to EDSim

English–Portuguese

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

threshold

spsim learned from 16 examples (edsim > 0.9)

edsim precisionedsim recall

edsim f-measure

spsim precisionspsim recall

spsim f-measure

English–German

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

threshold



edsim f-measure


spsim f-measure




0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

threshold



edsim f-measure


spsim f-measure

English–German

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

threshold



edsim f-measure


spsim f-measure


Comparing SPSim to EDSim

EN-ES18 examples (edsim > 0.9)

EN-FR31 examples (edsim > 0.9)

EN-PT16 examples (edsim > 0.9)

DE-EN4 examples (edsim > 0.9)

FR-IT14 examples (edsim > 0.9)


Conclusions

I SpSim learns fast

I SpSim has much better recall than EDSim (and LCSR)

I SpSim has the same time complexity as EDSim and LCSR

I SpSim is almost as easy to implement as EDSim or LCSR


Thanks for listening

Questions?


spsim

Education

common etymological

parallel segments

parallel corpora

dierent languages

portuguese

similar spellings

en fr

computing