cs626 : natural language processing/speech, nlp and …pb/cs626-2014/cs626-lect34-36... · cs626 :...

CS626 : Natural Language Processing/Speech, NLP and the Web(Lecture 34–36: Phonetics and phonology;

transliteration)

Pushpak BhattacharyyaCSE Dept., IIT Bombay

16, 27, 28 Oct, 2014

16 oct, 2014Phonetics-phonology, Pushpak

Bhattacharyya 1

Motivation

Language origin- speech sound Writing came much later

The primordial sound: a-u-m

ॐ16 oct, 2014

Phonetics-phonology, Pushpak Bhattacharyya 2

Motivation

Language origin- speech sound Writing came much later

Automatic Speech Recognition (ASR) Text to Speech (TTS) Speech to Speech Machine Translation Speech interface to search


Bhattacharyya 3

Ancient 5 x 5 Indian Classification of ConsonantsGroupक वग क ख ग घ ङ Velarच वग च छ ज झ ञ Palatalट वग ट ठ ड ढ ण Alveolarत वग त थ द ध न Dentalप वग प फ ब भ म Labial


Bhattacharyya 4

Grapheme-phoneme: confused relation

Fish == Ghoti ?


Bhattacharyya 5

Phonteic Symbols and IPA notation


Bhattacharyya 6

IPA: vowels


Bhattacharyya 7

Places of articulation


Bhattacharyya 8

Place of Articulation Labial: Two lips coming together

[p] as in possum, [b] as in bear Dental: Tongue against the teeth

[th] of thing or the [dh] of though Alveolar: Alveolar ridge is the portion of the roof of the mouth just behind the upper teeth; tip

of the tongue against the alveolar ridge. Phones [s], [z], [t], and [d]

Palatal: Roof of the mouth; blade of the tongue against this rising back of the alveolar ridge sounds [sh] (shrimp), [ch] (china), [zh] (Asian), and [jh] (jar)

Velar: Movable muscular flap at the back of the roof of the mouth; back of the tongue up against the velum sounds [k] (cuckoo), [g] (goose), and [N] (kingfisher)

Glottal: closing the glottis (by bringing the vocal folds together) glottal stop [q] (IPA [P]) is made by closing the glotis


Bhattacharyya 9

Manner of Articulation: Stops and Nasals

All consonants are produced by restriction of airflow Manner of Articulation; how the restriction is produced:

complete or partial stoppage A stop is a consonant in which airflow is completely blocked for a short time English has voiced stops like [b], [d], and [g] as well as unvoiced stops like [p], [t], and

[k]. Stops are also called plosives Nasal sounds [n], [m], and [ng] are made by lowering the velum and allowing air

to pass into the nasal cavity


Bhattacharyya 10

Vowels (1/2)


Bhattacharyya 11

Vowels (2/2)


Bhattacharyya 12

Fricatives Fricatives, airflow is constricted but not cut off completely. The turbulent airflow that

results from the constriction produces a characteristic “hissing” sound. The English labiodental fricatives [f] and [v] are produced by pressing the lower lip

against the upper teeth, allowing a restricted airflow between the upper teeth. The dental fricatives [th] and [dh] allow air to flow around the tongue between the teeth. The alveolar fricatives [s] and [z] are produced with the tongue against the alveolar

ridge, forcing air over the edge of the teeth. In the palato-alveolar fricatives [sh] and [zh] the tongue is at the back of the alveolar

ridge forcing air through a groove formed in the tongue.


Bhattacharyya 13

Affricates, Laterals/Liquids and Taps/Flaps

Affricates are stops followed immediately by fricatives English [ch] (chicken); Marathi chaa (e.g., gharaachaa; of the house)

Lateral or Liquids: tip of the tongue up against the alveolar ridge or the teeth, with one or both sides of the tongue lowered to allow air to flow over it [l] (learn)

Tap or flap: quick motion of the tongue against the alveolar ridge [dx] (IPA [R]) The consonant in the middle of the word lotus ([l ow dx ax s]) is a tap in most dialects

of American English speakers of many UK dialects would use a [t] instead of a tap in this word.


Bhattacharyya 14

Articulation of consonants: Larynx action/glottis state (1/2)

Vocal cords are pulled apart. The air passes freely through the glottis. This is called the voicelessness state and sounds produced with this configuration of the vocal cords are called voiceless: p t k f θ s ʃ tʃ

Vocal cords are pulled close together. The air passing through the glottis causes the vocal cords to vibrate. This is called the voicing state and sounds produced with this configuration of the vocal cords are called voiced: b d g v ð z ʒ dʒ


Bhattacharyya 15

Articulation of consonants: Larynx action/glottis state (2/2)

Vocal cords are apart at the back and pulled together at the front. This is called the whisper state.

Vocal cords assume the voicing state but are relaxed. This is called the murmur state.


Bhattacharyya 16

Phonology: Syllables


Bhattacharyya 17

Basic of syllables“Syllable is a unit of spoken languageconsisting of a single uninterrupted soundformed generally by a Vowel and preceded orfollowed by one or more consonants.”

Vowels are the heart of a syllable (MostSonorous Element) (svayam raajate itisvaraH)

Consonants act as sounds attached tovowels.16 oct, 2014


Syllable structure

A syllable consists of 3 major parts:- Onset (C) Nucleus (V) Coda (C)

Vowels sit in the Nucleus of a syllable Consonants may get attached as Onset

or Coda. Basic structure - CV


Bhattacharyya 19

Possible syllable structures The Nucleus is

always present Onset and Coda

may be absent Possible

structures V CV VC CVC16 oct, 2014


syllable theories Prominence Theory

E.g. entertaining /entəteɪnɪŋ/ The peaks of prominence: vowels /e ə eɪ

ɪ/ Number of syllables: 4

Chest Pulse Theory Based on muscular activities

Sonority Theory Based on relative soundness of segment

within words16 oct, 2014


Introduction to sonority theory“The Sonority of a sound is its loudness

relative to other sounds with the same length, stress and speech.”

Some sounds are more sonorous Words in a language can be divided into

syllables Sonority theory distinguishes syllables on

the basis of sounds.


Bhattacharyya 22

Sonority hierarchy Defined on the basis of amount of

sound associated The sonority hierarchy is as follows:-

Vowels (a, e, i, o, u) Liquids (y, r, l, v) Nasals (n, m) Fricatives (s, z, f,…..sh, th etc.) Affricates (ch, j) Stops (b, d, g, p, t, k)


Bhattacharyya 23

Sonority scale Obstruents can

be further classified into:- Fricatives Affricates Stops


Bhattacharyya 24

Sonority theory & syllables“A Syllable is a cluster of sonority, defined by a sonority peak acting as a structural magnet to the surrounding lower sonority elements.”

Represented as waves of sonority or Sonority Profile of that syllable

Nucleus

Onset Coda


Bhattacharyya 25

Sonority sequencing principle

“The Sonority Profile of a syllable must rise until its Peak(Nucleus), and then fall.”

Peak(Nucleus)

Onset Coda


Bhattacharyya 26

examples

ABHIJEET

A

BHI

JEET

ABHI

JEET

Profile-1

Profile-2


Bhattacharyya 27

Maximal onset principle“The Intervocalic consonants are maximally

assigned to the Onsets of syllables inconformity with Universal and Language-Specific Conditions.”

Determines underlying syllable division Example

DIPLOMADIP LO MA & DI PLOMA


Bhattacharyya 28

Syllable Structure: a more detailed look

Count of no. of syllables in a word is roughly/intuitively the no. of vocalic segments in a word.

Thus, presence of a vowel is an obligatory element in the structure of a syllable. This vowel is called “nucleus”.

Basic Configuration: (C)V(C). Part of syllable preceding the nucleus is called the onset. Elements coming after the nucleus are called the coda. Nucleus and coda together are referred to as the rhyme.

S ≡ Syllable, O ≡ OnsetR ≡ Rhyme, N ≡ NucleusCo ≡ Coda


Bhattacharyya 29

Syllable Structure: Examples ‘word’

‘sprint’


Bhattacharyya 30

Syllable Structure: Examples ‘may’

‘opt’

‘air’

No Coda.

No Onset.

No Coda, No Onset.


Bhattacharyya 31

Syllable Structure Open Syllable: ends in vowel Closed syllable: ends in consonant or consonant cluster

Light Syllable: A syllable which is open and ends in a short vowel General Description – CV. Example, ‘air’.

Heavy Syllable: Closed syllables or syllables ending in diphthong Example: ‘opt’ Example, ‘may’


Bhattacharyya 32

Syllabification: Determining Syllable Boundaries

Given a string of syllables (word), what is the coda of one and the onset of another?

In a sequence such as VCV, where V is any vowel and C is any consonant, is the medial C the coda of the first syllable (VC.V) or the onset of the second syllable (V.CV)?

To determine the correct groupings, there are some rules, two of them being the most important and significant: Maximal Onset Principle, Sonority Hierarchy


Bhattacharyya 33

An important grapheme-phoneme data


Bhattacharyya 34

CMU Pronouncing Dictionary machine-readable pronunciation

dictionary for North American English that contains over 125,000 words and their transcriptions.

The current phoneme set contains 39 phonemes


Bhattacharyya 35

“Parallel” CorpusPhoneme Example Translation ------- ------- -----------AA odd AA D AE at AE T AH hut HH AH T AO ought AO T AW cow K AW AY hide HH AY D B be B IY


Bhattacharyya 36

“Parallel” Corpus cntdPhoneme Example Translation ------- ------- -----------CH cheese CH IY Z D dee D IY DH thee DH IY EH Ed EH D ER hurt HH ER T EY ate EY T F fee F IY G green G R IY N HH he HH IY IH it IH T IY eat IY T JH gee JH IY


Bhattacharyya 37

Transliteration: grapheme-phoneme interaction

38

Work of Arjun, PhD student, Swapnil, Masters student

Sandhan

39

Mono & Cross Lingual Information Retrieval (CLIR) engine for Indian languages Input: Query in one of the nine Indian

languages (Hindi, Marathi, Tamil, Telugu, Bengali, Punjabi, Assamese. Gujarati, Oriya)

Output: In Hindi, English and Query Language

Built on top of Nutch-Solr Framework

Online processing

40

Snippet Translation

Summary Generation

Snippet Generation

Translation /Transliteration

MWE Lookup

NE Lookup

Analyzer

Query Formulation

Index

Information Extraction

Query translation is major component in CLIR Query translation = word translations + word transliteration Transliteration accuracy directly affects CLIR performance [AbdulJaleel & Larkey, 2003]

Query transliteration

41

Transliteration

42

Transliteration is a process of transforming a word from script of one language to other

Word pronunciation is generally preserved, or is modified according its pronunciation in target language

Examples:

Hindi Gujarati Englishअजुन અ ુ ન Arjunअंकुर ુ ર Ankurराहु ल રા ુલ Rahulवि नल વ નલ Swapnil

Linguistic challenges

43

Features affecting transliteration Schwa deletion Diphthongs Differences in character set

Each of them adds to ambiguity

What is Schwa deletion

44

Phenomenon where implicit Schwa of a word are deleted for correct pronunciation

What is Schwa A short ‘a’ vowel sound attached with every

consonant by default. This sound can be changed into any other vowel

sound by use of matras (dependent vowels) Schwa is the default vowel for a consonant No explicit matra required to represent Schwa

Examples: ‘गलत’ (galat, wrong) – ‘a’ deletion after ‘t’ sound

Effect of Schwa deletion

45

Schwa deletion observed in Indo-Aryan languages but not in Dravidian languages.

For example, In Hindi, ‘अजुन’ is pronounced as ‘Arjun’ while in

Kannada it is pronounced as ‘Arjuna’ Can be present in any part of the word For example,

‘गलती’ (galti, mistake) in Hindi observes implicit schwa deletion after the consonant ‘ल’ (la)

Increases ambiguity in transliteration

Diphthongs

46

Two vowel sounds coming together in a single syllable unit

Schwa along with independent vowel can together be replaced by a diphthong

For example, In Hindi, the word ‘मु ंबई’ (mumbai, Mumbai) can also

be represented as ‘मु ब’ै Schwa (a) and independent vowel ‘ई’ (i) combine to

form dependent vowel ‘◌ै’ (ai) Mostly observed in Dravidian languages

Any vowel in middle of the word gets converted to dependent vowel or diphthong

Differences in character set

47

Indian languages contain different sets of consonants Examples: Assamese, Bengali, Tamil have

lesser consonants In Bengali, the sounds ‘ba’ and ‘va’ are

represented by the same character ‘ব’ (ba) Tamil does not distinguish between voiced

and unvoiced consonants. Sounds ‘k’, ‘kh’, ‘g’ and ‘gh’ represented using

single character

Previous work

48

Om transliteration scheme (Madhavi et. al , 2005) provides a script representation which is common for all Indian languages.

Surana et. al (2008) highlighted the importance of origin of the word and proposed different ways of transliteration based on its origin

Malik (2006) tried to solve a special case of Punjabi machine transliteration. They converted Shahmukhi to Gurumukhi using rule based transliteration

Gupta et. al (2010) used an intermediate notation called as WX notation to transliterate the word from one language to another

Rama et. al (2009) proposed the use of SMT system for machine transliteration. Words are split into constituent characters and each character acts like a word in a sentence.

Approaches for transliteration

49

Rule Based Can be as simple as one to one character

mapping between languages Set of rules for each language pair

Statistical Word divided in to segments. Segmentation can be at character or

syllable level. Phrases learnt using SMT techniques.

Rule based

50

Character to character (Offset) Mapping

Set of rule to tackle mapping ambiguity For example while transliterating to Bengali,

all mappings from ‘va’ can be mapped to ‘ba’.

Hindi Kannadaक ಕ

ख ಖ

ग ಗ

घ ಘ

UnicodeTransfer basedtransliteration


Bhattacharyya 51

Statistical

52

Character Based (CS) Word split into individual characters

Syllable based Word split into syllables Automatic syllabification difficult

Hindi Kannada Englishव ि◌ द ◌् य ◌ा ल य ವ ◌ಿ ದ ◌್ ಯ ◌ಾ ಲ ಯ v i d y a l a y

अ र ◌् ज ◌ु न ಅ ರ ◌್ ಜ ◌ು ನ a r j u n

Hindi Kannada Englishव या लय ಾ ಲ ಯ vid ya layअर् जुन ಅ ಜು ನ ar jun

Vowel based segmentation

53

Split the word at vowel boundaries उ या न ---- u dya n

Learn joint characters as a whole than individual parts

Hindi Kannada Englishव या ल य ಾ ಲ ಯ vi dya layअ जु न ಅ ಜು ನ a rju n

Algorithm

54

Input: word in Indian language Output: Vowel segments Scan characters of input word from left

to right For each character c insert a space after

c if c is a vowel c is a full consonant and next character is not a

vowel next character is anusvar OR visarga OR chandra

bindu OR Chandra

Algorithm - English

55

Input: word in English Output: Vowel segments for input word Scan characters of input word from left

to right For each character c, insert a space

after c if c is a vowel and next character is not a vowel.

End

Experiments

56

Transliteration within Indian languages (IL-IL) System tested for 10 Indian languages viz.

Assamese (as), Bengali (bn), Gujarati (gu), Hindi (hi), Kannada (ka), Malayalam (ml), Marathi (mr), Tamil (ta) and Telugu (te)

Transliteration from Indian language to English (IL-En) System tested for transliteration from 10

Indian languages to English

Dataset

57

Names collected from India Child Names1 and Bachpan2

Parallel data collected from IndoWordNet (used as test data)

Data for each language pair is variable. Data size varies from 800 to 25000

words

1: www.indiachildnames.com 2: www.bachpan.com

Indian language to Indian language transliteration

58

Results

59

N-fold evaluation 10 – fold evaluation for 100 Indian

language pairs

Testing on standout test data Both systems test on test data of 1000

words extracted from IndoWordNet

Accuracy at rank 1 is used for evaluation

N-Fold Evaluation- Accuracies

60

pa as bn hi gu mr te kn ml ta

pa CS:53.28VS:82.71

CS:77.78VS:93.07

CS:57.47VS:98.29

CS:58.12VS:91.00

CS:57.80VS:81.12

CS:94.00VS:97.88

CS:60.96VS:97.73

CS:59.73VS:98.60

CS:58.13VS:98.42

as CS:40.09VS:82.92

CS:54.93VS:84.57

CS:56.14VS:85.81

CS:61.05VS:84.93

CS:48.70VS:82.34

CS:64.79VS:84.25

CS:61.28VS:85.05

CS:57.75VS:82.97

CS:57.38VS:83.48

bn CS:52.16VS:92.98

CS:63.89VS:86.94

CS:57.44VS:97.83

CS:77.51VS:92.38

CS:49.21VS:82.37

CS:95.82VS:98.01

CS:61.99VS:97.70

CS:56.50VS:98.92

CS:60.21VS:98.37

hi CS:49.28VS:89.38

CS:57.87VS:85.98

CS:69.36VS:88.96

CS:54.16VS:91.56

CS:42.57VS:84.20

CS:91.89VS:95.63

CS:60.74VS:95.26

CS:51.79VS:97.12

CS:55.15VS:96.33

gu CS:62.94VS:91.34

CS:60.43VS:86.79

CS:77.34VS:92.21

CS:87.69VS:99.11

CS:52.87VS:85.80

CS:96.94VS:97.38

CS:65.50VS:97.16

CS:54.23VS:98.32

CS:58.00VS:97.82

mr CS:45.56VS:81.60

CS:58.15VS:87.84

CS:57.92VS:84.45

CS:43.55VS:78.45

CS:60.21VS:83.30

CS:62.85VS:77.33

CS:73.60VS:77.16

CS:55.13VS:76.56

CS:47.88VS:80.34

te CS:83.68VS:98.27

CS:50.52VS:80.15

CS:85.84VS:97.63

CS:88.21VS:99.12

CS:95.46VS:97.38

CS:43.72VS:77.60

CS:64.69VS:99.26

CS:84.23VS:97.29

CS:80.15VS:98.11

kn CS:82.65VS:98.11

CS:53.66VS:82.01

CS:83.26VS:97.23

CS:88.17VS:99.06

CS:94.45VS:96.85

CS:49.07VS:79.35

CS:99.05VS:99.21

CS:97.94VS:99.66

CS:90.08VS:99.54

ml CS:84.73VS:99.30

CS:48.62VS:79.19

CS:96.91VS:99.25

CS:88.16VS:99.12

CS:97.18VS:98.62

CS:43.71VS:79.11

CS:98.30VS:99.30

CS:86.23VS:99.65

CS:87.51VS:98.43

ta CS:78.75VS:95.68

CS:52.59VS:78.61

CS:82.81VS:95.82

CS:53.83VS:96.18

CS:81.81VS:96.26

CS:44.58VS:76.46

CS:81.87VS:96.61

CS:82.86VS:97.42

CS:81.77VS:97.09

Observations – N-fold evaluation

61

Vowel Segmentation (VS) system outperforms Character segmentation (CS) system for all pairs

For many pairs, accuracy difference of around 30% is observed

Baseline – Rule based system

62


pa 56.1 79.5 94.5 84.7 66.2 94 92.3 93.7 67.7

as 62.2 80.1 65.3 59.8 59.8 55.3 58.2 55.7 -

bn 84.6 81 94.1 76.7 58 94.7 90.9 96.4 71.7

hi 84.7 64.3 73.1 85.3 73.7 93.4 92.9 96.1 73.6

gu 87.5 60.3 77.4 97.9 81.4 95.8 95.1 98.4 72.9

mr 65.2 61.1 56.4 73.2 82.9 67.5 66.6 67.2 -

te 96.9 56.1 91.9 98.5 96.8 65.8 98.9 99.1 73.1

kn 97.8 56.9 90.1 98.7 95.8 67.4 98.6 99.2 72.9

ml 99.3 54 95.3 98.9 98.2 68.4 99.4 99.2 72.5

ta 78.5 - 78.5 79.1 79.6 - 81.2 81.5 77.5

Character Set difference impacts

accuracies !!!

CS versus VS – Accuracies (Ignoring OOVs)

63


pa CS:77.50VS:82.93

CS:89.80VS:93.78

CS:96.80VS:98.60

CS:90.30VS:89.48

CS:77.80VS:78.91

CS:95.70VS:98.00

CS:96.80VS:98.40

CS:96.90VS:98.59

CS:98.50VS:98.29

as CS:73.17VS:83.72

CS:82.58VS:86.89

CS:76.38VS:87.21

CS:74.45VS:85.32

CS:71.07VS:81.35

CS:71.69VS:83.75

CS:73.95VS:86.58

CS:69.35VS:82.29 -

bn CS:90.30VS:93.09

CS:78.60VS:87.70

CS:97.40VS:97.79

CS:90.40VS:93.80

CS:68.20VS:80.66

CS:96.20VS:96.99

CS:95.50VS:97.09

CS:98.40VS:98.20

CS:97.70VS:97.98

hi CS:86.40VS:87.60

CS:79.30VS:84.94

CS:79.70VS:88.29

CS:81.20VS:87.99

CS:72.77VS:82.88

CS:95.70VS:96.49

CS:93.30VS:93.59

CS:95.40VS:96.79

CS:95.60VS:95.79

gu CS:89.30VS:88.80

CS:83.00VS:87.01

CS:84.10VS:91.19

CS:98.70VS:98.99

CS:81.60VS:82.98

CS:97.00VS:97.00

CS:95.70VS:96.70

CS:98.00VS:98.39

CS:98.00VS:98.19

mr CS:78.70VS:79.88

CS:79.48VS:89.37

CS:75.40VS:84.64

CS:66.87VS:75.88

CS:77.40VS:81.25

CS:67.00VS:76.46

CS:75.05VS:79.28

CS:69.62VS:76.84 -

te CS:97.40VS:98.40

CS:75.35VS:80.76

CS:96.40VS:98.09

CS:99.20VS:99.30

CS:97.60VS:98.20

CS:70.10VS:77.23

CS:98.70VS:98.80

CS:99.00VS:97.79

CS:98.50VS:98.79

kn CS:97.60VS:98.40

CS:76.48VS:82.19

CS:94.60VS:97.40

CS:98.50VS:98.90

CS:96.20VS:96.89

CS:71.64VS:81.03

CS:99.20VS:99.60

CS:99.50VS:99.90

CS:98.90VS:99.30

ml CS:99.00VS:99.10

CS:73.45VS:80.14

CS:99.60VS:99.70

CS:99.10VS:99.30

CS:98.40VS:99.00

CS:72.09VS:80.05

CS:98.90VS:99.40

CS:99.80VS:99.90

CS:97.20VS:97.89

ta CS:84.10VS:94.25 - CS:86.20

VS:95.36CS:86.80VS:95.46

CS:86.70VS:96.68 - CS:86.50

VS:96.59CS:86.90VS:96.18

CS:85.70VS:96.17

Consistent improvement in

accuracies!!!

Side Effect (OOVs)

64


pa CS:0VS:16

CS:0VS:3

CS:0VS:2

CS:0VS:2

CS:0VS:9

CS:0VS:2

CS:0VS:3

CS:0VS:4

CS:0VS:4

as CS:1VS:17

CS:0VS:0

CS:1VS:62

CS:2VS:19

CS:1VS:51

CS:4VS:114

CS:2VS:76

CS:5VS:153

bn CS:0VS:1

CS:0VS:0

CS:0VS:5

CS:0VS:0

CS:0VS:7

CS:0VS:3

CS:0VS:4

CS:0VS:2

CS:0VS:8

hi CS:0VS:0

CS:0VS:4

CS:0VS:1

CS:0VS:1

CS:0VS:0

CS:0VS:4

CS:0VS:2

CS:0VS:4

CS:0VS:3

gu CS:0VS:0

CS:0VS:15

CS:0VS:1

CS:0VS:5

CS:0VS:1

CS:0VS:1

CS:0VS:1

CS:0VS:4

CS:0VS:3

mr CS:0VS:1

CS:1VS:50

CS:0VS:4

CS:0VS:0

CS:0VS:8

CS:0VS:74

CS:2VS:59

CS:6VS:171

te CS:0VS:2

CS:2VS:101

CS:0VS:4

CS:0VS:2

CS:0VS:2

CS:0VS:60

CS:0VS:2

CS:0VS:3

CS:0VS:9

kn CS:0VS:3

CS:1VS:79

CS:0VS:0

CS:0VS:4

CS:0VS:4

CS:2VS:67

CS:0VS:3

CS:0VS:4

CS:0VS:2

ml CS:0VS:1

CS:17VS:139

CS:0VS:6

CS:0VS:5

CS:0VS:2

CS:4VS:153

CS:0VS:4

CS:0VS:1

CS:0VS:7

ta CS:0VS:8

CS:0VS:9

CS:0VS:9

CS:0VS:6

CS:0VS:3

CS:0VS:4

CS:0VS:8

Lesser coverage!!!

OOVs Impact

65


pa CS:77.50VS:81.60

CS:89.80VS:93.50

CS:96.80VS:98.40

CS:90.30VS:89.30

CS:77.80VS:78.20

CS:95.70VS:97.80

CS:96.80VS:98.10

CS:96.90VS:98.20

CS:98.50VS:97.90

as CS:73.10VS:82.30

CS:82.58VS:86.89

CS:76.30VS:81.80

CS:74.30VS:83.70

CS:71.00VS:77.20

CS:71.40VS:74.20

CS:73.80VS:80.00

CS:69.00VS:69.70 -

bn CS:90.30VS:93.00

CS:78.60VS:87.70

CS:97.40VS:97.30

CS:90.40VS:93.80

CS:68.20VS:80.10

CS:96.20VS:96.70

CS:95.50VS:96.70

CS:98.40VS:98.00

CS:97.70VS:97.20

hi CS:86.40VS:87.60

CS:79.30VS:84.60

CS:79.70VS:88.20

CS:81.20VS:87.90

CS:72.77VS:82.88

CS:95.70VS:96.10

CS:93.30VS:93.40

CS:95.40VS:96.40

CS:95.60VS:95.50

gu CS:89.30VS:88.80

CS:83.00VS:85.70

CS:84.10VS:91.10

CS:98.70VS:98.50

CS:81.60VS:82.90

CS:97.00VS:96.90

CS:95.70VS:96.60

CS:98.00VS:98.00

CS:98.00VS:97.90

mr CS:78.70VS:79.80

CS:79.40VS:84.90

CS:75.40VS:84.30

CS:66.87VS:75.88

CS:77.40VS:80.60

CS:67.00VS:70.80

CS:74.90VS:74.60

CS:69.20VS:63.70 -

te CS:97.40VS:98.20

CS:75.20VS:72.60

CS:96.40VS:97.70

CS:99.20VS:99.10

CS:97.60VS:98.00

CS:70.10VS:72.60

CS:98.70VS:98.60

CS:99.00VS:97.50

CS:98.50VS:97.90

kn CS:97.60VS:98.10

CS:76.40VS:75.70

CS:94.60VS:97.40

CS:98.50VS:98.50

CS:96.20VS:96.50

CS:71.50VS:75.60

CS:99.20VS:99.30

CS:99.50VS:99.50

CS:98.90VS:99.10

ml CS:99.00VS:99.00

CS:72.20VS:69.00

CS:99.60VS:99.10

CS:99.10VS:98.80

CS:98.40VS:98.80

CS:71.80VS:67.80

CS:98.90VS:99.00

CS:99.80VS:99.80

CS:97.20VS:97.20

ta CS:84.10VS:93.50 - CS:86.20

VS:94.50CS:86.80VS:94.60

CS:86.70VS:96.10 - CS:86.50

VS:96.30CS:86.90VS:95.80

CS:85.70VS:95.40

Correctness versus coverage

66

Vowel segmentation increases correctness with reduced coverage

Character segmentation has high coverage with low accuracy

Ideal transliteration system should have high coverage and high accuracy

Hybrid System

67

Use Character segmentation system for OOVs

Only segments not transliterated by VS system are transliterated using CS system

Hybrid System

68


pa CS:77.50VS:82.50

CS:89.80VS:93.70

CS:96.80VS:98.60

CS:90.30VS:89.50

CS:77.80VS:78.90

CS:95.70VS:97.90

CS:96.80VS:98.40

CS:96.90VS:98.50

CS:98.50VS:98.30

as CS:73.10VS:83.10

CS:82.58VS:86.89

CS:76.30VS:85.90

CS:74.30VS:84.80

CS:71.00VS:80.60

CS:71.40VS:81.70

CS:73.80VS:85.20

CS:69.00VS:78.40 -

bn CS:90.30VS:93.10

CS:78.60VS:87.70

CS:97.40VS:97.80

CS:90.40VS:93.80

CS:68.20VS:80.60

CS:96.20VS:96.90

CS:95.50VS:97.00

CS:98.40VS:98.20

CS:97.70VS:98.00

hi CS:86.40VS:87.60

CS:79.30VS:84.80

CS:79.70VS:88.30

CS:81.20VS:88.00

CS:72.77VS:82.88

CS:95.70VS:96.50

CS:93.30VS:93.60

CS:95.40VS:96.70

CS:95.60VS:95.80

gu CS:89.30VS:88.80

CS:83.00VS:87.00

CS:84.10VS:91.20

CS:98.70VS:99.00

CS:81.60VS:83.00

CS:97.00VS:97.00

CS:95.70VS:96.70

CS:98.00VS:98.40

CS:98.00VS:98.20

mr CS:78.70VS:79.90

CS:79.40VS:88.60

CS:75.40VS:84.40

CS:66.87VS:75.88

CS:77.40VS:81.40

CS:67.00VS:74.60

CS:74.90VS:78.60

CS:69.20VS:73.90 -

te CS:97.40VS:98.40

CS:75.20VS:79.80

CS:96.40VS:98.10

CS:99.20VS:99.30

CS:97.60VS:98.20

CS:70.10VS:76.90

CS:98.70VS:98.80

CS:99.00VS:97.70

CS:98.50VS:98.80

kn CS:97.60VS:98.40

CS:76.40VS:81.30

CS:94.60VS:97.40

CS:98.50VS:98.90

CS:96.20VS:96.80

CS:71.50VS:79.60

CS:99.20VS:99.60

CS:99.50VS:99.90

CS:98.90VS:99.30

ml CS:99.00VS:99.10

CS:72.20VS:77.70

CS:99.60VS:99.60

CS:99.10VS:99.30

CS:98.40VS:99.00

CS:71.80VS:77.70

CS:98.90VS:99.40

CS:99.80VS:99.90

CS:97.20VS:97.90

ta CS:84.10VS:94.30 - CS:86.20

VS:95.30CS:86.80VS:95.50

CS:86.70VS:96.60 - CS:86.50

VS:96.60CS:86.90VS:96.20

CS:85.70VS:95.90

Summary

69

Ta - Hi Accuracy F-Score

Rule Based 81.7 0.85

CS vs. VS CS:86.80VS:95.46

CS:0.975VS:0.989

OOVs impact CS:86.80VS:94.60

CS:0.975VS:0.987

Hybrid System CS:86.80VS:95.50

CS:0.975VS:0.989

Error analysis

70

Modified Levenshtein distance algorithm to compute the confusion matrix

Errors in CS based system persists in VS based system but to lesser extent

Common errors include confusion between Matras Voiced / unvoiced consonants Aspirated / Unaspirated consonants

Top 10 Errors Tamil - Hindi

71

Expected Output Count

1 थ(tha) त(ta) 52

2 त(ta) थ(tha) 22

3 ख(kha) क(ka) 19

4 फ(fa) प(pa) 14

5 झ(jha) ज(ja) 13

6 ◌्(halfconsonantmodifier) NULL 4

7 क(ka) ख(kha) 2

8 छ(chha) च(cha) 2

9 न(na) ◌ं('ma'or'na') 2

10 न(na) ण(na) 2

CS System

Expected Output Count

1 थ(tha) त(ta) 16

2 झ(jha) ज(ja) 7

3 ख(kha) क(ka) 5

4 ◌्(halfconsonantmodifier) NULL 4

5 त(ta) थ(tha) 3

6 फ(fa) प(pa) 3

7 न(na) ◌ं('ma'or'na') 2

8 न(na) ण(na) 2

9 म(ma) ◌ं('ma'or'na') 2

10 ◌ी(ee) ि◌(i) 2

VS System

How much is sufficient for Training?

72

Increase in training data reduces number of OOVs

Threshold on number of OOVs help determine lower limit for training data size

We considered OOVs below 50 as threshold

Tamil – Hindi stepwise accuracies

73

Step Wise results – Tamil as source

74

Tamil has higher coverage for lesser training data

Malayalam needs more training datafor coverage of all segments

Indian language to English transliteration

75

Challenges – Origin plays an important role

76

Many Indian language words are borrowed from Sanskrit called tatsama words.

For example, दुग (durg, fort) is retained in Hindi, Marathi and Kannada

दुगgets transliterated to durg in case of ‘Sindhudurg’ and durga in case of ‘Chitradurga’

Sindhudurg has influence of Marathi while Chitradurga has influence of Kannada

Origin of the place influences how it is written in English

Challenges- One to many mapping

77

Single character in Indian language represented by multiple characters in English

For example, ‘ख’ in Hindi is represented by ‘kh’ in English

Same character in Indian language may be represented by multiple English segments

For example, ‘◌ी’ in Hindi can be represented as ‘i’ as

well as ‘ee’

Results

78

N-fold evaluation performed on 10 Indian languages to English

as bn gu hi kn ml mr pa ta te

CS:04.62

VS:50.68

CS:05.66

VS:37.75

CS:12.45

VS:48.87

CS:11.71

VS:57.21

CS:12.31

VS:48.94

CS:12.72

VS:48.49

CS:08.24

VS:45.77

CS:06.39

VS:39.28

CS:07.74

VS:35.39

CS:10.25

VS:46.99

Error analysis

79

Schwa deletion and insertion is major cause of errors in transliteration

CS system: Character deletion, mainly at the end of the

word Schwa deletion

VS system Schwa deletion and insertion Confusion in matras

Top Ten Errors – Hindi to English

80

Expecte

dOutput Count

a NULL 33754

h NULL 15239

n NULL 12165

y NULL 5498

r NULL 4009

t NULL 3188

e NULL 2377

k NULL 1608

NULL a 1556

l NULL 1505

Expecte

dOutput Count

a NULL 9177

NULL a 3651

NULL i 2784

n NULL 2657

NULL h 1095

e i 1081

e NULL 869

NULL u 730

NULL e 717

i e 502

CS System VS System

Impact on Sandhan: Marathi CLIR

81

Query translation evaluation

82

Char based Transliteration

Vowel segmentation

Google translate

Char based Transliteration

Vowel segmentation

Google translate

Not stemmed Stemmed

0.44375 0.5375 0.6375 0.475 0.55625 0.65

Total Marathi-English translations evaluated = 80

Quality Score

Good 1

Medium 0.5

Bad 0

Examples

83

Marathi query Char based system VS system Google

translate

सागरे वर sagareshawar sagareshwar sagares on

ठोसेघर धबधबा thoseghrwaterfall

thosegharwaterfall

thosegharachute

उरमोडी नद urmodi river uramody river uramodi river

संत ाने वर saint jayaneshawar

saint jjnaneshwar sant gyaneshwar

Error analysis of translations

84

Incorrect entries in translation dictionary Example: district misspelt as distt

Most frequent used translation word not picked Example: क ला (killa,fort) gets translated

to stronghold instead of fort.

Stemming error Example: लेणी (leni, cave) gets stemmed to लेणे (lene)

cs626 : natural language processing/speech, nlp and …pb/cs626-2014/cs626-lect34-36... · cs626 :...

Documents