cs626 : natural language processing/speech, nlp and …pb/cs626-2014/cs626-lect34-36... · cs626 :...
TRANSCRIPT
CS626 : Natural Language Processing/Speech, NLP and the Web(Lecture 34–36: Phonetics and phonology;
transliteration)
Pushpak BhattacharyyaCSE Dept., IIT Bombay
16, 27, 28 Oct, 2014
16 oct, 2014Phonetics-phonology, Pushpak
Bhattacharyya 1
Motivation
Language origin- speech sound Writing came much later
The primordial sound: a-u-m
ॐ16 oct, 2014
Phonetics-phonology, Pushpak Bhattacharyya 2
Motivation
Language origin- speech sound Writing came much later
Automatic Speech Recognition (ASR) Text to Speech (TTS) Speech to Speech Machine Translation Speech interface to search
16 oct, 2014Phonetics-phonology, Pushpak
Bhattacharyya 3
Ancient 5 x 5 Indian Classification of ConsonantsGroupक वग क ख ग घ ङ Velarच वग च छ ज झ ञ Palatalट वग ट ठ ड ढ ण Alveolarत वग त थ द ध न Dentalप वग प फ ब भ म Labial
16 oct, 2014Phonetics-phonology, Pushpak
Bhattacharyya 4
Grapheme-phoneme: confused relation
Fish == Ghoti ?
16 oct, 2014Phonetics-phonology, Pushpak
Bhattacharyya 5
Phonteic Symbols and IPA notation
16 oct, 2014Phonetics-phonology, Pushpak
Bhattacharyya 6
IPA: vowels
16 oct, 2014Phonetics-phonology, Pushpak
Bhattacharyya 7
Places of articulation
16 oct, 2014Phonetics-phonology, Pushpak
Bhattacharyya 8
Place of Articulation Labial: Two lips coming together
[p] as in possum, [b] as in bear Dental: Tongue against the teeth
[th] of thing or the [dh] of though Alveolar: Alveolar ridge is the portion of the roof of the mouth just behind the upper teeth; tip
of the tongue against the alveolar ridge. Phones [s], [z], [t], and [d]
Palatal: Roof of the mouth; blade of the tongue against this rising back of the alveolar ridge sounds [sh] (shrimp), [ch] (china), [zh] (Asian), and [jh] (jar)
Velar: Movable muscular flap at the back of the roof of the mouth; back of the tongue up against the velum sounds [k] (cuckoo), [g] (goose), and [N] (kingfisher)
Glottal: closing the glottis (by bringing the vocal folds together) glottal stop [q] (IPA [P]) is made by closing the glotis
16 oct, 2014Phonetics-phonology, Pushpak
Bhattacharyya 9
Manner of Articulation: Stops and Nasals
All consonants are produced by restriction of airflow Manner of Articulation; how the restriction is produced:
complete or partial stoppage A stop is a consonant in which airflow is completely blocked for a short time English has voiced stops like [b], [d], and [g] as well as unvoiced stops like [p], [t], and
[k]. Stops are also called plosives Nasal sounds [n], [m], and [ng] are made by lowering the velum and allowing air
to pass into the nasal cavity
16 oct, 2014Phonetics-phonology, Pushpak
Bhattacharyya 10
Vowels (1/2)
16 oct, 2014Phonetics-phonology, Pushpak
Bhattacharyya 11
Vowels (2/2)
16 oct, 2014Phonetics-phonology, Pushpak
Bhattacharyya 12
Fricatives Fricatives, airflow is constricted but not cut off completely. The turbulent airflow that
results from the constriction produces a characteristic “hissing” sound. The English labiodental fricatives [f] and [v] are produced by pressing the lower lip
against the upper teeth, allowing a restricted airflow between the upper teeth. The dental fricatives [th] and [dh] allow air to flow around the tongue between the teeth. The alveolar fricatives [s] and [z] are produced with the tongue against the alveolar
ridge, forcing air over the edge of the teeth. In the palato-alveolar fricatives [sh] and [zh] the tongue is at the back of the alveolar
ridge forcing air through a groove formed in the tongue.
16 oct, 2014Phonetics-phonology, Pushpak
Bhattacharyya 13
Affricates, Laterals/Liquids and Taps/Flaps
Affricates are stops followed immediately by fricatives English [ch] (chicken); Marathi chaa (e.g., gharaachaa; of the house)
Lateral or Liquids: tip of the tongue up against the alveolar ridge or the teeth, with one or both sides of the tongue lowered to allow air to flow over it [l] (learn)
Tap or flap: quick motion of the tongue against the alveolar ridge [dx] (IPA [R]) The consonant in the middle of the word lotus ([l ow dx ax s]) is a tap in most dialects
of American English speakers of many UK dialects would use a [t] instead of a tap in this word.
16 oct, 2014Phonetics-phonology, Pushpak
Bhattacharyya 14
Articulation of consonants: Larynx action/glottis state (1/2)
Vocal cords are pulled apart. The air passes freely through the glottis. This is called the voicelessness state and sounds produced with this configuration of the vocal cords are called voiceless: p t k f θ s ʃ tʃ
Vocal cords are pulled close together. The air passing through the glottis causes the vocal cords to vibrate. This is called the voicing state and sounds produced with this configuration of the vocal cords are called voiced: b d g v ð z ʒ dʒ
16 oct, 2014Phonetics-phonology, Pushpak
Bhattacharyya 15
Articulation of consonants: Larynx action/glottis state (2/2)
Vocal cords are apart at the back and pulled together at the front. This is called the whisper state.
Vocal cords assume the voicing state but are relaxed. This is called the murmur state.
16 oct, 2014Phonetics-phonology, Pushpak
Bhattacharyya 16
Phonology: Syllables
16 oct, 2014Phonetics-phonology, Pushpak
Bhattacharyya 17
Basic of syllables“Syllable is a unit of spoken languageconsisting of a single uninterrupted soundformed generally by a Vowel and preceded orfollowed by one or more consonants.”
Vowels are the heart of a syllable (MostSonorous Element) (svayam raajate itisvaraH)
Consonants act as sounds attached tovowels.16 oct, 2014
Phonetics-phonology, Pushpak Bhattacharyya 18
Syllable structure
A syllable consists of 3 major parts:- Onset (C) Nucleus (V) Coda (C)
Vowels sit in the Nucleus of a syllable Consonants may get attached as Onset
or Coda. Basic structure - CV
16 oct, 2014Phonetics-phonology, Pushpak
Bhattacharyya 19
Possible syllable structures The Nucleus is
always present Onset and Coda
may be absent Possible
structures V CV VC CVC16 oct, 2014
Phonetics-phonology, Pushpak Bhattacharyya 20
syllable theories Prominence Theory
E.g. entertaining /entəteɪnɪŋ/ The peaks of prominence: vowels /e ə eɪ
ɪ/ Number of syllables: 4
Chest Pulse Theory Based on muscular activities
Sonority Theory Based on relative soundness of segment
within words16 oct, 2014
Phonetics-phonology, Pushpak Bhattacharyya 21
Introduction to sonority theory“The Sonority of a sound is its loudness
relative to other sounds with the same length, stress and speech.”
Some sounds are more sonorous Words in a language can be divided into
syllables Sonority theory distinguishes syllables on
the basis of sounds.
16 oct, 2014Phonetics-phonology, Pushpak
Bhattacharyya 22
Sonority hierarchy Defined on the basis of amount of
sound associated The sonority hierarchy is as follows:-
Vowels (a, e, i, o, u) Liquids (y, r, l, v) Nasals (n, m) Fricatives (s, z, f,…..sh, th etc.) Affricates (ch, j) Stops (b, d, g, p, t, k)
16 oct, 2014Phonetics-phonology, Pushpak
Bhattacharyya 23
Sonority scale Obstruents can
be further classified into:- Fricatives Affricates Stops
16 oct, 2014Phonetics-phonology, Pushpak
Bhattacharyya 24
Sonority theory & syllables“A Syllable is a cluster of sonority, defined by a sonority peak acting as a structural magnet to the surrounding lower sonority elements.”
Represented as waves of sonority or Sonority Profile of that syllable
Nucleus
Onset Coda
16 oct, 2014Phonetics-phonology, Pushpak
Bhattacharyya 25
Sonority sequencing principle
“The Sonority Profile of a syllable must rise until its Peak(Nucleus), and then fall.”
Peak(Nucleus)
Onset Coda
16 oct, 2014Phonetics-phonology, Pushpak
Bhattacharyya 26
examples
ABHIJEET
A
BHI
JEET
ABHI
JEET
Profile-1
Profile-2
16 oct, 2014Phonetics-phonology, Pushpak
Bhattacharyya 27
Maximal onset principle“The Intervocalic consonants are maximally
assigned to the Onsets of syllables inconformity with Universal and Language-Specific Conditions.”
Determines underlying syllable division Example
DIPLOMADIP LO MA & DI PLOMA
16 oct, 2014Phonetics-phonology, Pushpak
Bhattacharyya 28
Syllable Structure: a more detailed look
Count of no. of syllables in a word is roughly/intuitively the no. of vocalic segments in a word.
Thus, presence of a vowel is an obligatory element in the structure of a syllable. This vowel is called “nucleus”.
Basic Configuration: (C)V(C). Part of syllable preceding the nucleus is called the onset. Elements coming after the nucleus are called the coda. Nucleus and coda together are referred to as the rhyme.
S ≡ Syllable, O ≡ OnsetR ≡ Rhyme, N ≡ NucleusCo ≡ Coda
16 oct, 2014Phonetics-phonology, Pushpak
Bhattacharyya 29
Syllable Structure: Examples ‘word’
‘sprint’
16 oct, 2014Phonetics-phonology, Pushpak
Bhattacharyya 30
Syllable Structure: Examples ‘may’
‘opt’
‘air’
No Coda.
No Onset.
No Coda, No Onset.
16 oct, 2014Phonetics-phonology, Pushpak
Bhattacharyya 31
Syllable Structure Open Syllable: ends in vowel Closed syllable: ends in consonant or consonant cluster
Light Syllable: A syllable which is open and ends in a short vowel General Description – CV. Example, ‘air’.
Heavy Syllable: Closed syllables or syllables ending in diphthong Example: ‘opt’ Example, ‘may’
16 oct, 2014Phonetics-phonology, Pushpak
Bhattacharyya 32
Syllabification: Determining Syllable Boundaries
Given a string of syllables (word), what is the coda of one and the onset of another?
In a sequence such as VCV, where V is any vowel and C is any consonant, is the medial C the coda of the first syllable (VC.V) or the onset of the second syllable (V.CV)?
To determine the correct groupings, there are some rules, two of them being the most important and significant: Maximal Onset Principle, Sonority Hierarchy
16 oct, 2014Phonetics-phonology, Pushpak
Bhattacharyya 33
An important grapheme-phoneme data
16 oct, 2014Phonetics-phonology, Pushpak
Bhattacharyya 34
CMU Pronouncing Dictionary machine-readable pronunciation
dictionary for North American English that contains over 125,000 words and their transcriptions.
The current phoneme set contains 39 phonemes
16 oct, 2014Phonetics-phonology, Pushpak
Bhattacharyya 35
“Parallel” CorpusPhoneme Example Translation ------- ------- -----------AA odd AA D AE at AE T AH hut HH AH T AO ought AO T AW cow K AW AY hide HH AY D B be B IY
16 oct, 2014Phonetics-phonology, Pushpak
Bhattacharyya 36
“Parallel” Corpus cntdPhoneme Example Translation ------- ------- -----------CH cheese CH IY Z D dee D IY DH thee DH IY EH Ed EH D ER hurt HH ER T EY ate EY T F fee F IY G green G R IY N HH he HH IY IH it IH T IY eat IY T JH gee JH IY
16 oct, 2014Phonetics-phonology, Pushpak
Bhattacharyya 37
Transliteration: grapheme-phoneme interaction
38
Work of Arjun, PhD student, Swapnil, Masters student
Sandhan
39
Mono & Cross Lingual Information Retrieval (CLIR) engine for Indian languages Input: Query in one of the nine Indian
languages (Hindi, Marathi, Tamil, Telugu, Bengali, Punjabi, Assamese. Gujarati, Oriya)
Output: In Hindi, English and Query Language
Built on top of Nutch-Solr Framework
Online processing
40
Snippet Translation
Summary Generation
Snippet Generation
Translation /Transliteration
MWE Lookup
NE Lookup
Analyzer
Query Formulation
Index
Information Extraction
Query translation is major component in CLIR Query translation = word translations + word transliteration Transliteration accuracy directly affects CLIR performance [AbdulJaleel & Larkey, 2003]
Query transliteration
41
Transliteration
42
Transliteration is a process of transforming a word from script of one language to other
Word pronunciation is generally preserved, or is modified according its pronunciation in target language
Examples:
Hindi Gujarati Englishअजुन અ ુ ન Arjunअंकुर ુ ર Ankurराहु ल રા ુલ Rahulवि नल વ નલ Swapnil
Linguistic challenges
43
Features affecting transliteration Schwa deletion Diphthongs Differences in character set
Each of them adds to ambiguity
What is Schwa deletion
44
Phenomenon where implicit Schwa of a word are deleted for correct pronunciation
What is Schwa A short ‘a’ vowel sound attached with every
consonant by default. This sound can be changed into any other vowel
sound by use of matras (dependent vowels) Schwa is the default vowel for a consonant No explicit matra required to represent Schwa
Examples: ‘गलत’ (galat, wrong) – ‘a’ deletion after ‘t’ sound
Effect of Schwa deletion
45
Schwa deletion observed in Indo-Aryan languages but not in Dravidian languages.
For example, In Hindi, ‘अजुन’ is pronounced as ‘Arjun’ while in
Kannada it is pronounced as ‘Arjuna’ Can be present in any part of the word For example,
‘गलती’ (galti, mistake) in Hindi observes implicit schwa deletion after the consonant ‘ल’ (la)
Increases ambiguity in transliteration
Diphthongs
46
Two vowel sounds coming together in a single syllable unit
Schwa along with independent vowel can together be replaced by a diphthong
For example, In Hindi, the word ‘मु ंबई’ (mumbai, Mumbai) can also
be represented as ‘मु ब’ै Schwa (a) and independent vowel ‘ई’ (i) combine to
form dependent vowel ‘◌ै’ (ai) Mostly observed in Dravidian languages
Any vowel in middle of the word gets converted to dependent vowel or diphthong
Differences in character set
47
Indian languages contain different sets of consonants Examples: Assamese, Bengali, Tamil have
lesser consonants In Bengali, the sounds ‘ba’ and ‘va’ are
represented by the same character ‘ব’ (ba) Tamil does not distinguish between voiced
and unvoiced consonants. Sounds ‘k’, ‘kh’, ‘g’ and ‘gh’ represented using
single character
Previous work
48
Om transliteration scheme (Madhavi et. al , 2005) provides a script representation which is common for all Indian languages.
Surana et. al (2008) highlighted the importance of origin of the word and proposed different ways of transliteration based on its origin
Malik (2006) tried to solve a special case of Punjabi machine transliteration. They converted Shahmukhi to Gurumukhi using rule based transliteration
Gupta et. al (2010) used an intermediate notation called as WX notation to transliterate the word from one language to another
Rama et. al (2009) proposed the use of SMT system for machine transliteration. Words are split into constituent characters and each character acts like a word in a sentence.
Approaches for transliteration
49
Rule Based Can be as simple as one to one character
mapping between languages Set of rules for each language pair
Statistical Word divided in to segments. Segmentation can be at character or
syllable level. Phrases learnt using SMT techniques.
Rule based
50
Character to character (Offset) Mapping
Set of rule to tackle mapping ambiguity For example while transliterating to Bengali,
all mappings from ‘va’ can be mapped to ‘ba’.
Hindi Kannadaक ಕ
ख ಖ
ग ಗ
घ ಘ
UnicodeTransfer basedtransliteration
16 oct, 2014Phonetics-phonology, Pushpak
Bhattacharyya 51
Statistical
52
Character Based (CS) Word split into individual characters
Syllable based Word split into syllables Automatic syllabification difficult
Hindi Kannada Englishव ि◌ द ◌् य ◌ा ल य ವ ◌ಿ ದ ◌್ ಯ ◌ಾ ಲ ಯ v i d y a l a y
अ र ◌् ज ◌ु न ಅ ರ ◌್ ಜ ◌ು ನ a r j u n
Hindi Kannada Englishव या लय ಾ ಲ ಯ vid ya layअर् जुन ಅ ಜು ನ ar jun
Vowel based segmentation
53
Split the word at vowel boundaries उ या न ---- u dya n
Learn joint characters as a whole than individual parts
Hindi Kannada Englishव या ल य ಾ ಲ ಯ vi dya layअ जु न ಅ ಜು ನ a rju n
Algorithm
54
Input: word in Indian language Output: Vowel segments Scan characters of input word from left
to right For each character c insert a space after
c if c is a vowel c is a full consonant and next character is not a
vowel next character is anusvar OR visarga OR chandra
bindu OR Chandra
Algorithm - English
55
Input: word in English Output: Vowel segments for input word Scan characters of input word from left
to right For each character c, insert a space
after c if c is a vowel and next character is not a vowel.
End
Experiments
56
Transliteration within Indian languages (IL-IL) System tested for 10 Indian languages viz.
Assamese (as), Bengali (bn), Gujarati (gu), Hindi (hi), Kannada (ka), Malayalam (ml), Marathi (mr), Tamil (ta) and Telugu (te)
Transliteration from Indian language to English (IL-En) System tested for transliteration from 10
Indian languages to English
Dataset
57
Names collected from India Child Names1 and Bachpan2
Parallel data collected from IndoWordNet (used as test data)
Data for each language pair is variable. Data size varies from 800 to 25000
words
1: www.indiachildnames.com 2: www.bachpan.com
Indian language to Indian language transliteration
58
Results
59
N-fold evaluation 10 – fold evaluation for 100 Indian
language pairs
Testing on standout test data Both systems test on test data of 1000
words extracted from IndoWordNet
Accuracy at rank 1 is used for evaluation
N-Fold Evaluation- Accuracies
60
pa as bn hi gu mr te kn ml ta
pa CS:53.28VS:82.71
CS:77.78VS:93.07
CS:57.47VS:98.29
CS:58.12VS:91.00
CS:57.80VS:81.12
CS:94.00VS:97.88
CS:60.96VS:97.73
CS:59.73VS:98.60
CS:58.13VS:98.42
as CS:40.09VS:82.92
CS:54.93VS:84.57
CS:56.14VS:85.81
CS:61.05VS:84.93
CS:48.70VS:82.34
CS:64.79VS:84.25
CS:61.28VS:85.05
CS:57.75VS:82.97
CS:57.38VS:83.48
bn CS:52.16VS:92.98
CS:63.89VS:86.94
CS:57.44VS:97.83
CS:77.51VS:92.38
CS:49.21VS:82.37
CS:95.82VS:98.01
CS:61.99VS:97.70
CS:56.50VS:98.92
CS:60.21VS:98.37
hi CS:49.28VS:89.38
CS:57.87VS:85.98
CS:69.36VS:88.96
CS:54.16VS:91.56
CS:42.57VS:84.20
CS:91.89VS:95.63
CS:60.74VS:95.26
CS:51.79VS:97.12
CS:55.15VS:96.33
gu CS:62.94VS:91.34
CS:60.43VS:86.79
CS:77.34VS:92.21
CS:87.69VS:99.11
CS:52.87VS:85.80
CS:96.94VS:97.38
CS:65.50VS:97.16
CS:54.23VS:98.32
CS:58.00VS:97.82
mr CS:45.56VS:81.60
CS:58.15VS:87.84
CS:57.92VS:84.45
CS:43.55VS:78.45
CS:60.21VS:83.30
CS:62.85VS:77.33
CS:73.60VS:77.16
CS:55.13VS:76.56
CS:47.88VS:80.34
te CS:83.68VS:98.27
CS:50.52VS:80.15
CS:85.84VS:97.63
CS:88.21VS:99.12
CS:95.46VS:97.38
CS:43.72VS:77.60
CS:64.69VS:99.26
CS:84.23VS:97.29
CS:80.15VS:98.11
kn CS:82.65VS:98.11
CS:53.66VS:82.01
CS:83.26VS:97.23
CS:88.17VS:99.06
CS:94.45VS:96.85
CS:49.07VS:79.35
CS:99.05VS:99.21
CS:97.94VS:99.66
CS:90.08VS:99.54
ml CS:84.73VS:99.30
CS:48.62VS:79.19
CS:96.91VS:99.25
CS:88.16VS:99.12
CS:97.18VS:98.62
CS:43.71VS:79.11
CS:98.30VS:99.30
CS:86.23VS:99.65
CS:87.51VS:98.43
ta CS:78.75VS:95.68
CS:52.59VS:78.61
CS:82.81VS:95.82
CS:53.83VS:96.18
CS:81.81VS:96.26
CS:44.58VS:76.46
CS:81.87VS:96.61
CS:82.86VS:97.42
CS:81.77VS:97.09
Observations – N-fold evaluation
61
Vowel Segmentation (VS) system outperforms Character segmentation (CS) system for all pairs
For many pairs, accuracy difference of around 30% is observed
Baseline – Rule based system
62
pa as bn hi gu mr te kn ml ta
pa 56.1 79.5 94.5 84.7 66.2 94 92.3 93.7 67.7
as 62.2 80.1 65.3 59.8 59.8 55.3 58.2 55.7 -
bn 84.6 81 94.1 76.7 58 94.7 90.9 96.4 71.7
hi 84.7 64.3 73.1 85.3 73.7 93.4 92.9 96.1 73.6
gu 87.5 60.3 77.4 97.9 81.4 95.8 95.1 98.4 72.9
mr 65.2 61.1 56.4 73.2 82.9 67.5 66.6 67.2 -
te 96.9 56.1 91.9 98.5 96.8 65.8 98.9 99.1 73.1
kn 97.8 56.9 90.1 98.7 95.8 67.4 98.6 99.2 72.9
ml 99.3 54 95.3 98.9 98.2 68.4 99.4 99.2 72.5
ta 78.5 - 78.5 79.1 79.6 - 81.2 81.5 77.5
Character Set difference impacts
accuracies !!!
CS versus VS – Accuracies (Ignoring OOVs)
63
pa as bn hi gu mr te kn ml ta
pa CS:77.50VS:82.93
CS:89.80VS:93.78
CS:96.80VS:98.60
CS:90.30VS:89.48
CS:77.80VS:78.91
CS:95.70VS:98.00
CS:96.80VS:98.40
CS:96.90VS:98.59
CS:98.50VS:98.29
as CS:73.17VS:83.72
CS:82.58VS:86.89
CS:76.38VS:87.21
CS:74.45VS:85.32
CS:71.07VS:81.35
CS:71.69VS:83.75
CS:73.95VS:86.58
CS:69.35VS:82.29 -
bn CS:90.30VS:93.09
CS:78.60VS:87.70
CS:97.40VS:97.79
CS:90.40VS:93.80
CS:68.20VS:80.66
CS:96.20VS:96.99
CS:95.50VS:97.09
CS:98.40VS:98.20
CS:97.70VS:97.98
hi CS:86.40VS:87.60
CS:79.30VS:84.94
CS:79.70VS:88.29
CS:81.20VS:87.99
CS:72.77VS:82.88
CS:95.70VS:96.49
CS:93.30VS:93.59
CS:95.40VS:96.79
CS:95.60VS:95.79
gu CS:89.30VS:88.80
CS:83.00VS:87.01
CS:84.10VS:91.19
CS:98.70VS:98.99
CS:81.60VS:82.98
CS:97.00VS:97.00
CS:95.70VS:96.70
CS:98.00VS:98.39
CS:98.00VS:98.19
mr CS:78.70VS:79.88
CS:79.48VS:89.37
CS:75.40VS:84.64
CS:66.87VS:75.88
CS:77.40VS:81.25
CS:67.00VS:76.46
CS:75.05VS:79.28
CS:69.62VS:76.84 -
te CS:97.40VS:98.40
CS:75.35VS:80.76
CS:96.40VS:98.09
CS:99.20VS:99.30
CS:97.60VS:98.20
CS:70.10VS:77.23
CS:98.70VS:98.80
CS:99.00VS:97.79
CS:98.50VS:98.79
kn CS:97.60VS:98.40
CS:76.48VS:82.19
CS:94.60VS:97.40
CS:98.50VS:98.90
CS:96.20VS:96.89
CS:71.64VS:81.03
CS:99.20VS:99.60
CS:99.50VS:99.90
CS:98.90VS:99.30
ml CS:99.00VS:99.10
CS:73.45VS:80.14
CS:99.60VS:99.70
CS:99.10VS:99.30
CS:98.40VS:99.00
CS:72.09VS:80.05
CS:98.90VS:99.40
CS:99.80VS:99.90
CS:97.20VS:97.89
ta CS:84.10VS:94.25 - CS:86.20
VS:95.36CS:86.80VS:95.46
CS:86.70VS:96.68 - CS:86.50
VS:96.59CS:86.90VS:96.18
CS:85.70VS:96.17
Consistent improvement in
accuracies!!!
Side Effect (OOVs)
64
pa as bn hi gu mr te kn ml ta
pa CS:0VS:16
CS:0VS:3
CS:0VS:2
CS:0VS:2
CS:0VS:9
CS:0VS:2
CS:0VS:3
CS:0VS:4
CS:0VS:4
as CS:1VS:17
CS:0VS:0
CS:1VS:62
CS:2VS:19
CS:1VS:51
CS:4VS:114
CS:2VS:76
CS:5VS:153
bn CS:0VS:1
CS:0VS:0
CS:0VS:5
CS:0VS:0
CS:0VS:7
CS:0VS:3
CS:0VS:4
CS:0VS:2
CS:0VS:8
hi CS:0VS:0
CS:0VS:4
CS:0VS:1
CS:0VS:1
CS:0VS:0
CS:0VS:4
CS:0VS:2
CS:0VS:4
CS:0VS:3
gu CS:0VS:0
CS:0VS:15
CS:0VS:1
CS:0VS:5
CS:0VS:1
CS:0VS:1
CS:0VS:1
CS:0VS:4
CS:0VS:3
mr CS:0VS:1
CS:1VS:50
CS:0VS:4
CS:0VS:0
CS:0VS:8
CS:0VS:74
CS:2VS:59
CS:6VS:171
te CS:0VS:2
CS:2VS:101
CS:0VS:4
CS:0VS:2
CS:0VS:2
CS:0VS:60
CS:0VS:2
CS:0VS:3
CS:0VS:9
kn CS:0VS:3
CS:1VS:79
CS:0VS:0
CS:0VS:4
CS:0VS:4
CS:2VS:67
CS:0VS:3
CS:0VS:4
CS:0VS:2
ml CS:0VS:1
CS:17VS:139
CS:0VS:6
CS:0VS:5
CS:0VS:2
CS:4VS:153
CS:0VS:4
CS:0VS:1
CS:0VS:7
ta CS:0VS:8
CS:0VS:9
CS:0VS:9
CS:0VS:6
CS:0VS:3
CS:0VS:4
CS:0VS:8
Lesser coverage!!!
OOVs Impact
65
pa as bn hi gu mr te kn ml ta
pa CS:77.50VS:81.60
CS:89.80VS:93.50
CS:96.80VS:98.40
CS:90.30VS:89.30
CS:77.80VS:78.20
CS:95.70VS:97.80
CS:96.80VS:98.10
CS:96.90VS:98.20
CS:98.50VS:97.90
as CS:73.10VS:82.30
CS:82.58VS:86.89
CS:76.30VS:81.80
CS:74.30VS:83.70
CS:71.00VS:77.20
CS:71.40VS:74.20
CS:73.80VS:80.00
CS:69.00VS:69.70 -
bn CS:90.30VS:93.00
CS:78.60VS:87.70
CS:97.40VS:97.30
CS:90.40VS:93.80
CS:68.20VS:80.10
CS:96.20VS:96.70
CS:95.50VS:96.70
CS:98.40VS:98.00
CS:97.70VS:97.20
hi CS:86.40VS:87.60
CS:79.30VS:84.60
CS:79.70VS:88.20
CS:81.20VS:87.90
CS:72.77VS:82.88
CS:95.70VS:96.10
CS:93.30VS:93.40
CS:95.40VS:96.40
CS:95.60VS:95.50
gu CS:89.30VS:88.80
CS:83.00VS:85.70
CS:84.10VS:91.10
CS:98.70VS:98.50
CS:81.60VS:82.90
CS:97.00VS:96.90
CS:95.70VS:96.60
CS:98.00VS:98.00
CS:98.00VS:97.90
mr CS:78.70VS:79.80
CS:79.40VS:84.90
CS:75.40VS:84.30
CS:66.87VS:75.88
CS:77.40VS:80.60
CS:67.00VS:70.80
CS:74.90VS:74.60
CS:69.20VS:63.70 -
te CS:97.40VS:98.20
CS:75.20VS:72.60
CS:96.40VS:97.70
CS:99.20VS:99.10
CS:97.60VS:98.00
CS:70.10VS:72.60
CS:98.70VS:98.60
CS:99.00VS:97.50
CS:98.50VS:97.90
kn CS:97.60VS:98.10
CS:76.40VS:75.70
CS:94.60VS:97.40
CS:98.50VS:98.50
CS:96.20VS:96.50
CS:71.50VS:75.60
CS:99.20VS:99.30
CS:99.50VS:99.50
CS:98.90VS:99.10
ml CS:99.00VS:99.00
CS:72.20VS:69.00
CS:99.60VS:99.10
CS:99.10VS:98.80
CS:98.40VS:98.80
CS:71.80VS:67.80
CS:98.90VS:99.00
CS:99.80VS:99.80
CS:97.20VS:97.20
ta CS:84.10VS:93.50 - CS:86.20
VS:94.50CS:86.80VS:94.60
CS:86.70VS:96.10 - CS:86.50
VS:96.30CS:86.90VS:95.80
CS:85.70VS:95.40
Correctness versus coverage
66
Vowel segmentation increases correctness with reduced coverage
Character segmentation has high coverage with low accuracy
Ideal transliteration system should have high coverage and high accuracy
Hybrid System
67
Use Character segmentation system for OOVs
Only segments not transliterated by VS system are transliterated using CS system
Hybrid System
68
pa as bn hi gu mr te kn ml ta
pa CS:77.50VS:82.50
CS:89.80VS:93.70
CS:96.80VS:98.60
CS:90.30VS:89.50
CS:77.80VS:78.90
CS:95.70VS:97.90
CS:96.80VS:98.40
CS:96.90VS:98.50
CS:98.50VS:98.30
as CS:73.10VS:83.10
CS:82.58VS:86.89
CS:76.30VS:85.90
CS:74.30VS:84.80
CS:71.00VS:80.60
CS:71.40VS:81.70
CS:73.80VS:85.20
CS:69.00VS:78.40 -
bn CS:90.30VS:93.10
CS:78.60VS:87.70
CS:97.40VS:97.80
CS:90.40VS:93.80
CS:68.20VS:80.60
CS:96.20VS:96.90
CS:95.50VS:97.00
CS:98.40VS:98.20
CS:97.70VS:98.00
hi CS:86.40VS:87.60
CS:79.30VS:84.80
CS:79.70VS:88.30
CS:81.20VS:88.00
CS:72.77VS:82.88
CS:95.70VS:96.50
CS:93.30VS:93.60
CS:95.40VS:96.70
CS:95.60VS:95.80
gu CS:89.30VS:88.80
CS:83.00VS:87.00
CS:84.10VS:91.20
CS:98.70VS:99.00
CS:81.60VS:83.00
CS:97.00VS:97.00
CS:95.70VS:96.70
CS:98.00VS:98.40
CS:98.00VS:98.20
mr CS:78.70VS:79.90
CS:79.40VS:88.60
CS:75.40VS:84.40
CS:66.87VS:75.88
CS:77.40VS:81.40
CS:67.00VS:74.60
CS:74.90VS:78.60
CS:69.20VS:73.90 -
te CS:97.40VS:98.40
CS:75.20VS:79.80
CS:96.40VS:98.10
CS:99.20VS:99.30
CS:97.60VS:98.20
CS:70.10VS:76.90
CS:98.70VS:98.80
CS:99.00VS:97.70
CS:98.50VS:98.80
kn CS:97.60VS:98.40
CS:76.40VS:81.30
CS:94.60VS:97.40
CS:98.50VS:98.90
CS:96.20VS:96.80
CS:71.50VS:79.60
CS:99.20VS:99.60
CS:99.50VS:99.90
CS:98.90VS:99.30
ml CS:99.00VS:99.10
CS:72.20VS:77.70
CS:99.60VS:99.60
CS:99.10VS:99.30
CS:98.40VS:99.00
CS:71.80VS:77.70
CS:98.90VS:99.40
CS:99.80VS:99.90
CS:97.20VS:97.90
ta CS:84.10VS:94.30 - CS:86.20
VS:95.30CS:86.80VS:95.50
CS:86.70VS:96.60 - CS:86.50
VS:96.60CS:86.90VS:96.20
CS:85.70VS:95.90
Summary
69
Ta - Hi Accuracy F-Score
Rule Based 81.7 0.85
CS vs. VS CS:86.80VS:95.46
CS:0.975VS:0.989
OOVs impact CS:86.80VS:94.60
CS:0.975VS:0.987
Hybrid System CS:86.80VS:95.50
CS:0.975VS:0.989
Error analysis
70
Modified Levenshtein distance algorithm to compute the confusion matrix
Errors in CS based system persists in VS based system but to lesser extent
Common errors include confusion between Matras Voiced / unvoiced consonants Aspirated / Unaspirated consonants
Top 10 Errors Tamil - Hindi
71
Expected Output Count
1 थ(tha) त(ta) 52
2 त(ta) थ(tha) 22
3 ख(kha) क(ka) 19
4 फ(fa) प(pa) 14
5 झ(jha) ज(ja) 13
6 ◌्(halfconsonantmodifier) NULL 4
7 क(ka) ख(kha) 2
8 छ(chha) च(cha) 2
9 न(na) ◌ं('ma'or'na') 2
10 न(na) ण(na) 2
CS System
Expected Output Count
1 थ(tha) त(ta) 16
2 झ(jha) ज(ja) 7
3 ख(kha) क(ka) 5
4 ◌्(halfconsonantmodifier) NULL 4
5 त(ta) थ(tha) 3
6 फ(fa) प(pa) 3
7 न(na) ◌ं('ma'or'na') 2
8 न(na) ण(na) 2
9 म(ma) ◌ं('ma'or'na') 2
10 ◌ी(ee) ि◌(i) 2
VS System
How much is sufficient for Training?
72
Increase in training data reduces number of OOVs
Threshold on number of OOVs help determine lower limit for training data size
We considered OOVs below 50 as threshold
Tamil – Hindi stepwise accuracies
73
Step Wise results – Tamil as source
74
Tamil has higher coverage for lesser training data
Malayalam needs more training datafor coverage of all segments
Indian language to English transliteration
75
Challenges – Origin plays an important role
76
Many Indian language words are borrowed from Sanskrit called tatsama words.
For example, दुग (durg, fort) is retained in Hindi, Marathi and Kannada
दुगgets transliterated to durg in case of ‘Sindhudurg’ and durga in case of ‘Chitradurga’
Sindhudurg has influence of Marathi while Chitradurga has influence of Kannada
Origin of the place influences how it is written in English
Challenges- One to many mapping
77
Single character in Indian language represented by multiple characters in English
For example, ‘ख’ in Hindi is represented by ‘kh’ in English
Same character in Indian language may be represented by multiple English segments
For example, ‘◌ी’ in Hindi can be represented as ‘i’ as
well as ‘ee’
Results
78
N-fold evaluation performed on 10 Indian languages to English
as bn gu hi kn ml mr pa ta te
CS:04.62
VS:50.68
CS:05.66
VS:37.75
CS:12.45
VS:48.87
CS:11.71
VS:57.21
CS:12.31
VS:48.94
CS:12.72
VS:48.49
CS:08.24
VS:45.77
CS:06.39
VS:39.28
CS:07.74
VS:35.39
CS:10.25
VS:46.99
Error analysis
79
Schwa deletion and insertion is major cause of errors in transliteration
CS system: Character deletion, mainly at the end of the
word Schwa deletion
VS system Schwa deletion and insertion Confusion in matras
Top Ten Errors – Hindi to English
80
Expecte
dOutput Count
a NULL 33754
h NULL 15239
n NULL 12165
y NULL 5498
r NULL 4009
t NULL 3188
e NULL 2377
k NULL 1608
NULL a 1556
l NULL 1505
Expecte
dOutput Count
a NULL 9177
NULL a 3651
NULL i 2784
n NULL 2657
NULL h 1095
e i 1081
e NULL 869
NULL u 730
NULL e 717
i e 502
CS System VS System
Impact on Sandhan: Marathi CLIR
81
Query translation evaluation
82
Char based Transliteration
Vowel segmentation
Google translate
Char based Transliteration
Vowel segmentation
Google translate
Not stemmed Stemmed
0.44375 0.5375 0.6375 0.475 0.55625 0.65
Total Marathi-English translations evaluated = 80
Quality Score
Good 1
Medium 0.5
Bad 0
Examples
83
Marathi query Char based system VS system Google
translate
सागरे वर sagareshawar sagareshwar sagares on
ठोसेघर धबधबा thoseghrwaterfall
thosegharwaterfall
thosegharachute
उरमोडी नद urmodi river uramody river uramodi river
संत ाने वर saint jayaneshawar
saint jjnaneshwar sant gyaneshwar
Error analysis of translations
84
Incorrect entries in translation dictionary Example: district misspelt as distt
Most frequent used translation word not picked Example: क ला (killa,fort) gets translated
to stronghold instead of fort.
Stemming error Example: लेणी (leni, cave) gets stemmed to लेणे (lene)