spsim
DESCRIPTION
This talk presents SpSim, a new string similarity measure for identifying cognates that is tolerant towards characteristic spelling differences that are automatically extracted from a set of cognates known apriori.Talk given at EPIA 2011, October 10, 2011, LisboaTRANSCRIPT
Measuring Spelling Similarityfor Cognate Identification
Luıs Gomes
Faculdade de Ciencias e Tecnologiada Universidade Nova de Lisboa
EPIA 2011, October 10, 2011, Lisboa
What are cognates?
In linguistics, cognates are words that have a commonetymological origin. – Wikipedia
Example
The words etymology (English) and etimologia (Portuguese) bothderive from Greek etymologıa through Latin etymologia.
EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes
What are cognates?
In linguistics, cognates are words that have a commonetymological origin. – Wikipedia
Example
The words etymology (English) and etimologia (Portuguese) bothderive from Greek etymologıa through Latin etymologia.
EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes
What are cognates?
I am particularly interested in cognates of different languages, thatretain the same meaning, such as
German symbole themen operativeEnglish symbols themes operationalFrench symboles themes operationnelleSpanish sımbolos temas operativaPortuguese sımbolos temas operacionalItalian simboli temi operativa
EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes
What are cognates?
I am particularly interested in cognates of different languages, thatretain the same meaning, such as
German symbole themen operativeEnglish symbols themes operationalFrench symboles themes operationnelleSpanish sımbolos temas operativaPortuguese sımbolos temas operacionalItalian simboli temi operativa
EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes
What are cognates?
I am particularly interested in cognates of different languages, thatretain the same meaning, such as
German demokratische aspekte justizEnglish democratic aspects justiceFrench democratique aspects justiceSpanish democratica aspectos justiciaPortuguese democratica aspectos justicaItalian democratica aspetti giustizia
EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes
Extracting cognates from parallel corpora
Example parallel sentences
The Member States shall
coordinate their economic
policies within the Union .
Os Estados - Membros
coordenam as suas polıticas
economicas no ambito da
Uniao .
Spelling similarity
Cognates typically have similar spellings.
AssociationTranslations tend to co-occur systematically in parallel texts, whilenon-translations co-occur by chance.
EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes
Extracting cognates from parallel corpora
Example parallel sentences
The Member States shall
coordinate their economic
policies within the Union .
Os Estados - Membros
coordenam as suas polıticas
economicas no ambito da
Uniao .
Spelling similarity
Cognates typically have similar spellings.
AssociationTranslations tend to co-occur systematically in parallel texts, whilenon-translations co-occur by chance.
EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes
Extracting cognates from parallel corpora
Example parallel sentences
The Member States shall
coordinate their economic
policies within the Union .
Os Estados - Membros
coordenam as suas polıticas
economicas no ambito da
Uniao .
Spelling similarity
Cognates typically have similar spellings.
AssociationTranslations tend to co-occur systematically in parallel texts, whilenon-translations co-occur by chance.
EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes
Extracting cognates from parallel corpora
Spelling similarity measures
I EDSim (Edit-Distance-based Similarity)
I LCSR (Longest Common Subsequence Ratio)
I and a few others . . .
Association measures
I Dice
I SCP (Symmetric Conditional Probability)
I Mutual-Information
I and many others . . .
EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes
Extracting cognates from parallel corpora
Most commonly used spelling similarity measures
EDSim(w1,w2) = 1− ED(w1,w2)
max(w1,w2)
ED(w1,w2) is the Edit Distance between words w1 and w2.
LCSR(w1,w2) =LCS(w1,w2)
max(w1,w2)
LCS(w1,w2) is the length of the Longest Common Subsequencebetween words w1 and w2.
EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes
Extracting cognates from parallel corpora
Problem with these measuresThey look at strings too literally!
EDSim(photographic, fotografica) = 0.5
LCSR(photographic, fotografica) = 0.58
The spelling similarity score should be closer to 1.0 to reflecthuman judgement.
EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes
How does SpSim work?
First we align the two strings to find differences
This takes O(w1w2) time, just like computing ED(w1,w2) orLCS(w1,w2).
ˆ ph o t o g r aph i c $
ˆ f o t o g r af i c o $
Then we check which differences we may ignore
I Is “ph f” in the hashtable?
I Is “aph af” in the hashtable?
I Is “ o” in the hashtable?
In learning mode we would insert these differences into thehastable instead.
EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes
How does SpSim work?
First we align the two strings to find differences
This takes O(w1w2) time, just like computing ED(w1,w2) orLCS(w1,w2).
ˆ ph o t o g r aph i c $
ˆ f o t o g r af i c o $
Then we check which differences we may ignore
I Is “ph f” in the hashtable?
I Is “aph af” in the hashtable?
I Is “ o” in the hashtable?
In learning mode we would insert these differences into thehastable instead.
EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes
How does SpSim work?
First we align the two strings to find differences
This takes O(w1w2) time, just like computing ED(w1,w2) orLCS(w1,w2).
ˆ ph o t o g r aph i c $
ˆ f o t o g r af i c o $
Then we check which differences we may ignore
I Is “ph f” in the hashtable?
I Is “aph af” in the hashtable?
I Is “ o” in the hashtable?
In learning mode we would insert these differences into thehastable instead.
EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes
How does SpSim work?
First we align the two strings to find differences
This takes O(w1w2) time, just like computing ED(w1,w2) orLCS(w1,w2).
ˆ ph o t o g r aph i c $
ˆ f o t o g r af i c o $
Then we check which differences we may ignore
I Is “ph f” in the hashtable?
I Is “aph af” in the hashtable?
I Is “ o” in the hashtable?
In learning mode we would insert these differences into thehastable instead.
EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes
How does SpSim work?
Finally, we compute SpSim
SpSim(w1,w2) = 1−∑
i Di
w1 + w2
Di is the length of each difference that cannot be ignored.If no difference is ignored, then SpSim(w1,w2) = EDSim(w1,w2).
EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes
How does SpSim work?
Problem: over-generalization
Some differences such as insert an “o” in the Portuguese word aretoo vague and may occur by chance (ie, even if the words aretotally unrelated).
ˆ ph o t o g r aph i c $
ˆ f o t o g r af i c o $
EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes
How does SpSim work?
Solution: contextualize first and generalize afterwards
Contextualized differences are less likely to occur by chance.Example: insert an “o” at the end of the Portuguese word if theEnglish word ends with a “c”.
ˆpho t o g raphi c$
ˆfo t o g rafi co$
Whenever we find the same difference in a different context wemay generalize it.
ˆpha s e $
ˆfa s e $
“ˆpho ˆfo” + “ˆpha ˆfa” = “ˆph ˆf”
EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes
Experimental setup
Corpora
I used a parallel corpus of texts from the European Constitution infive language pairs.
Method
1. Obtain a list of putative cognates by thresholding anassociation measure (Dice).
2. Manually verify all putative cognates.
3. Compare the precision, recall and f-measure of SpSim andEDSim for a series of different threshold values.
EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes
Experimental setup
I used Dice to extract the initial list of putative cognates
Dice(x , y) =2 F(x , y)
F(x) + F(y)
F (x , y) is the number of co-occurrences in all parallel segments si :
F(x , y) =∑i
min(f(x , si ), f(y , si ))
F(x) and F(y) are the total number of occurrences of x and y inall parallel segments si :
F(x) =∑i
f(x , si ) ; F(y) =∑i
f(y , si )
EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes
Experimental setup
Extracted all pairs of words with Dice ≥ 0.6.
Summary of extraction and manual verification
Language Pair Accepted Rejected Total
German-English 269 878 1147English-Spanish 399 749 1148English-French 380 825 1205English-Portuguese 410 796 1206French-Italian 635 974 1609
EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes
Comparing SpSim to EDSim
English–Portuguese
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
threshold
spsim learned from 16 examples (edsim > 0.9)
edsim precisionedsim recall
edsim f-measure
spsim precisionspsim recall
spsim f-measure
English–German
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
threshold
spsim learned from 4 examples (edsim > 0.9)
edsim precisionedsim recall
edsim f-measure
spsim precisionspsim recall
spsim f-measure
EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes
Comparing SpSim to EDSim
English–Portuguese
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
threshold
spsim learned from 57 examples (edsim > 0.8)
edsim precisionedsim recall
edsim f-measure
spsim precisionspsim recall
spsim f-measure
English–German
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
threshold
spsim learned from 25 examples (edsim > 0.8)
edsim precisionedsim recall
edsim f-measure
spsim precisionspsim recall
spsim f-measure
EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes
Comparing SpSim to EDSim
English–Portuguese
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
threshold
spsim learned from 140 examples (edsim > 0.7)
edsim precisionedsim recall
edsim f-measure
spsim precisionspsim recall
spsim f-measure
English–German
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
threshold
spsim learned from 46 examples (edsim > 0.7)
edsim precisionedsim recall
edsim f-measure
spsim precisionspsim recall
spsim f-measure
EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes
Comparing SpSim to EDSim
English–Portuguese
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
threshold
spsim learned from 202 examples (edsim > 0.6)
edsim precisionedsim recall
edsim f-measure
spsim precisionspsim recall
spsim f-measure
English–German
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
threshold
spsim learned from 61 examples (edsim > 0.6)
edsim precisionedsim recall
edsim f-measure
spsim precisionspsim recall
spsim f-measure
EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes
Comparing SpSim to EDSim
English–Portuguese
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
threshold
spsim learned from 246 examples (edsim > 0.5)
edsim precisionedsim recall
edsim f-measure
spsim precisionspsim recall
spsim f-measure
English–German
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
threshold
spsim learned from 75 examples (edsim > 0.5)
edsim precisionedsim recall
edsim f-measure
spsim precisionspsim recall
spsim f-measure
EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes
Comparing SpSim to EDSim
English–Portuguese
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
threshold
spsim learned from 267 examples (edsim > 0.4)
edsim precisionedsim recall
edsim f-measure
spsim precisionspsim recall
spsim f-measure
English–German
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
threshold
spsim learned from 89 examples (edsim > 0.4)
edsim precisionedsim recall
edsim f-measure
spsim precisionspsim recall
spsim f-measure
EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes
Comparing SpSim to EDSim
English–Portuguese
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
threshold
spsim learned from 299 examples (edsim > 0.3)
edsim precisionedsim recall
edsim f-measure
spsim precisionspsim recall
spsim f-measure
English–German
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
threshold
spsim learned from 106 examples (edsim > 0.3)
edsim precisionedsim recall
edsim f-measure
spsim precisionspsim recall
spsim f-measure
EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes
Comparing SpSim to EDSim
English–Portuguese
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
threshold
spsim learned from 346 examples (edsim > 0.2)
edsim precisionedsim recall
edsim f-measure
spsim precisionspsim recall
spsim f-measure
English–German
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
threshold
spsim learned from 147 examples (edsim > 0.2)
edsim precisionedsim recall
edsim f-measure
spsim precisionspsim recall
spsim f-measure
EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes
Comparing SpSim to EDSim
English–Portuguese
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
threshold
spsim learned from 380 examples (edsim > 0.1)
edsim precisionedsim recall
edsim f-measure
spsim precisionspsim recall
spsim f-measure
English–German
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
threshold
spsim learned from 219 examples (edsim > 0.1)
edsim precisionedsim recall
edsim f-measure
spsim precisionspsim recall
spsim f-measure
EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes
Comparing SPSim to EDSim
EN-ES18 examples (edsim > 0.9)
EN-FR31 examples (edsim > 0.9)
EN-PT16 examples (edsim > 0.9)
DE-EN4 examples (edsim > 0.9)
FR-IT14 examples (edsim > 0.9)
EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes
Comparing SPSim to EDSim
EN-ES103 examples (edsim > 0.8)
EN-FR93 examples (edsim > 0.8)
EN-PT57 examples (edsim > 0.8)
DE-EN25 examples (edsim > 0.8)
FR-IT124 examples (edsim > 0.8)
EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes
Comparing SPSim to EDSim
EN-ES168 examples (edsim > 0.7)
EN-FR149 examples (edsim > 0.7)
EN-PT140 examples (edsim > 0.7)
DE-EN46 examples (edsim > 0.7)
FR-IT251 examples (edsim > 0.7)
EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes
Comparing SPSim to EDSim
EN-ES203 examples (edsim > 0.6)
EN-FR181 examples (edsim > 0.6)
EN-PT202 examples (edsim > 0.6)
DE-EN61 examples (edsim > 0.6)
FR-IT362 examples (edsim > 0.6)
EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes
Comparing SPSim to EDSim
EN-ES244 examples (edsim > 0.5)
EN-FR220 examples (edsim > 0.5)
EN-PT246 examples (edsim > 0.5)
DE-EN75 examples (edsim > 0.5)
FR-IT449 examples (edsim > 0.5)
EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes
Comparing SPSim to EDSim
EN-ES255 examples (edsim > 0.4)
EN-FR234 examples (edsim > 0.4)
EN-PT267 examples (edsim > 0.4)
DE-EN89 examples (edsim > 0.4)
FR-IT502 examples (edsim > 0.4)
EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes
Comparing SPSim to EDSim
EN-ES286 examples (edsim > 0.3)
EN-FR260 examples (edsim > 0.3)
EN-PT299 examples (edsim > 0.3)
DE-EN106 examples (edsim > 0.3)
FR-IT538 examples (edsim > 0.3)
EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes
Comparing SPSim to EDSim
EN-ES329 examples (edsim > 0.2)
EN-FR301 examples (edsim > 0.2)
EN-PT346 examples (edsim > 0.2)
DE-EN147 examples (edsim > 0.2)
FR-IT581 examples (edsim > 0.2)
EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes
Comparing SPSim to EDSim
EN-ES368 examples (edsim > 0.1)
EN-FR343 examples (edsim > 0.1)
EN-PT380 examples (edsim > 0.1)
DE-EN219 examples (edsim > 0.1)
FR-IT622 examples (edsim > 0.1)
EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes
Conclusions
I SpSim learns fast
I SpSim has much better recall than EDSim (and LCSR)
I SpSim has the same time complexity as EDSim and LCSR
I SpSim is almost as easy to implement as EDSim or LCSR
EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes
Conclusions
I SpSim learns fast
I SpSim has much better recall than EDSim (and LCSR)
I SpSim has the same time complexity as EDSim and LCSR
I SpSim is almost as easy to implement as EDSim or LCSR
EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes
Conclusions
I SpSim learns fast
I SpSim has much better recall than EDSim (and LCSR)
I SpSim has the same time complexity as EDSim and LCSR
I SpSim is almost as easy to implement as EDSim or LCSR
EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes
Conclusions
I SpSim learns fast
I SpSim has much better recall than EDSim (and LCSR)
I SpSim has the same time complexity as EDSim and LCSR
I SpSim is almost as easy to implement as EDSim or LCSR
EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes
Thanks for listening
Questions?
EPIA 2011 Measuring Spelling Similarity for Cognate Identification Luıs Gomes