diac+: a professional diacritics recovering system

DIAC+: A Professional Diacritics Recovering System

Research Institute for Artificial Intelligence, Romanian Academy

13, Calea "13 Septembrie", 050711, Bucharest

Dan Tufiş Alexandru Ceauş[email protected] [email protected]

mailto:[email protected]

mailto:[email protected]

Outlook

I. MotivationsII. Related work and different approachesIII. Diacritics in RomanianIV. DIAC+ Architecture V. EvaluationVI. Implementation

Motivations

• Almost all European languages use diacritics• In most languages that use diacritical characters, they are

usually not only decorative, but they may have grammatical and/or semantic meaning

• The lack or the wrong use of the diacritics is extremely annoying especially in texts meant for publication.

• Why the lack of diacritics still happens nowadays?– reuse of older texts– ergonomic factors (non-localized keyboards, multiple key-

strokes for a diacritical character)– inappropriate authoring tools or character-set converters – typos

Different approaches & related work

• Word-based (dictionary supported) approaches:– El-Bèze et al (1994), Yarowsky (1994),

Spriet & El-Bèze (1997), Simard (1998), Tufiş & Chiţu (1999) etc.

• Character-based approaches:– Mihalcea (2002), Bobiceva (2008),

Zweigenbaum and Grabar (2002), Wagacha et al. (2006), De Pauw et al. (2007) etc.

Diacritics in Romanian (I)• Romanian language has 5 diacritical characters:

ă,â,î,ş and ţ (plus their uppercase variants)• Two categories of words that may contain

diacritics:– U-words (Unambiguous words): the class of legal

words of Romanian, which when their diacritics are stripped-off, are not words of the language anymore:

• padure (pădure - forest), tufis (tufiş - bush), cantar (cântar - balance), carare (cărare - pathway), casmir (caşmir - cashmere), macar (măcar - at least), fara (fără - without), cati (câţi - how many)

Their recovery is trivial when a back-up lexicon is available

Diacritics in Romanian (II)

– A-words (Ambiguous words): the class of legal words of Romanian, which when their diacritics are stripped-off, are still words of the language; these words are never identified by a traditional spell-checker;

for instance the string fata could mean any of the following:

• fata – the girl, fată – a girl; or (about animals) gives birth , fâţa – the quick-swimming little fish/the coquette, fâţă – a quick-swimming little fish/a coquette, faţa – the face, faţă – a face, făta – (about animals) to give birth; gave birth, fătă – (about animals) just gave birth.

Diacritics in Romanian (III)• Most A-words could be disambiguated based on

grammatical information; those that cannot, are called S-words (Semantically ambiguous words).

• The proper treatment of S-words (characterized by the same morpho-syntactic properties) require semantic disambiguation.– For the previous example, knowing the morpho-syntactic

properties (Ncfsry: common nouns, feminine, definite forms and direct case), still leaves three diacritics restoration possibilities with very different meanings:

• fata (Ncfsry) – the girl, • faţa (Ncfsry) – the face. • fâţa (Ncfsry) – the quick-swimming little fish/the coquette,

A text may :– be completely diacritics-free (Tufiş and Chitu 1999) or– partially contain diacritics (and not always in a correct way);

this is a harder case

Diacritics in Romanian (IV)Corpus Journalism(Agenda) Juridical (Acquis)

1. No of words 6,680,448 3,511,093

1* No. of chars 37,008,236 21,404,666

2. No. of words with diacs (of 1) 2,004,763 (30,01%) 1,026,385 (29,23%

2* No. of diacs 2,351,220 1,192,875

3. U-words (of 2) 238,132 (11,88%) 175,822 (17,13%)

4. A-words (of 2) 1,766,631 (88,12%) 850,563 (82,87%)

5. S-words (Ctag‑set, of 4) 58,420 (3,31%) 38,323 (4,51%)

6. S-words (MSDtag‑set, of 4) 24,916 (1,41%) 16,463 (1,94%)

In an ideal setting, with a full coverage dictionary and a text with no typographical error other than the missing diacritics, about 25% (#A-words/#Words) of the total number of words would remain ambiguous.Our supposedly error free texts: 72722 (1.09%) typing errors (journalism texts) and 29387 (0.84%) typing errors (juridical texts).

DIAC+ Architecture Input text

Output text & spelling alternatives

(i) Tokenization

(iii) Tiered tagging

(ii) Hypotheses generation

(iv) Candidate selection

(v) Unknown words processing

D0,D1,D2 DictionariesLanguage

model

Tokenizer resources

Character model

Dictionaries D0, D1, D2 and Hypotheses Generation

• LEX dictionary – normative lexicon <wordform><tag>>lemma>; 1million entries

• D0 dictionary is the subset of LEX containing all the words with at least one diacritical character;

• D1 dictionary is the diacritics stripped-off version of LEX; • D2 dictionary contains words in the current text which are neither in D0 nor in D1

and which are suspected of being typing errors; they are derived from the words in D0D1 differing by plus or minus one character or by switching two consecutive characters (additionally, the switched characters should be neighbors on the keyboard)

• In the hypotheses generation step, a word is first searched in D0D1 • If the word cannot be found in D0D1 it is searched in the D2 dictionary. A word

which is not found in any of the system's lexicons is considered unknown and irrecoverable by the word-based approach, and its processing is left in charge of a character-based recovery module.

• a word W, occurring in the current text, may be associated with several entries in the LEX word-form lexicon <surface-formk MSDk>; the tagging step will be used to filter this set and eventually select the single contextually correct <surface-formi>.

Tiered Tagging &Candidate selection

• a special HMM language model in which the transition probabilities were computed from the regular training corpora (i.e. with diacritics) and the emission probabilities were computed from the diacritics stripped-off training corpora.

• TT = a two step tagging process– Tagging with a reduced tagset LM (92 tags)– Recovering left-out information from the lexical tagset (615 tags)

• Candidate selection. The U-words are replaced with their diacritical counterpart. The A-words which are not S-words are replaced by the surface-form identified by the MSD assigned by the tagger to the respective A-word. For the S-words, either the user is presented with a list of contextually meaningful choices or the replacement is automatically done based on lexical probabilities or some probabilistic preferences.

Character Model and Unknown Words Processing (I)

• Unknown word processing is used as backup for the candidate selection stage where no equivalent word-form was found in the lexicon. This case is quite rare – very few words are not covered by our almost 1,000,000 entries lexicon. The unknown word processing can be designed to work in parallel with the candidate selection phase. For processing unknown words, we used a character-based N-gram model similar to the one used in (Mihalcea, 2002).

• We used SRILM - SRI Language Modeling Toolkit (Stolcke, 2002) to train several character models. The training corpus contained 5,124,277 characters (including spaces) in 48,308 sentences and the test corpus has 613,234 characters in 6,411 sentences.

Character Model and Unknown Words Processing (II)

Model order

Perplexity Accuracy (no spaces)

Model size

2-gram 12.42 93.67% 20.8 KB

3-gram 9.72 95.52% 223 KB

4-gram 7.11 97.72% 1.29 MB

5-gram 5.77 98.59% 4.82 MB

6-gram 5.29 98.79% 13.1 MB

7-gram 5.17 98.84% 27.7 MB

8-gram 5.18 98.85% 48.4 MB

We used Viterbi estimation with a 5-gram character model to find the most probable string for the unknown word.

Evaluation (I)

• Word-based vs. Character-based evaluations• The evaluation scenario

– R=reference corpus, tokenized, tagged and lemmatized; hand validated (cca. 118,000 words and about 502,000 characters).

– TT = the diacritics stripped-off version of R– RT = the tag and lemma stripped-of version of TT– Baseline system: from the Agenda Corpus (10 mio words) we

derived a dictionary for which the head entries are non-diacritical forms of words and body of the entry is the list of diacritical counterparts each with the frequency in the corpus; the baseline system replaces a head word from this lexicon with the most frequent diacritical counterpart

Word-based Evaluation

DIAC- tagged text (TT)

DIAC - raw text (RT)

Baseline system

Tokens 117,909

Words with diacritics

34,745 (29,47%)

S-words 361

Unknown words 2130 (1,8%)

Correct words 116,810 (99,06%) 115,262 (97,75%) 113,491 (96,25%)

Incorrect words 1,092 (0,94%) 2,609 (2,25%) 4,418 (3,75%)

Character-based EvaluationDIAC- tagged text (TT)

DIAC - raw text (RT)

Baseline system

Characters (no spaces)

501,735

Diacritical characters

41,144 (8,2%)

Correct characters (no spaces)

500400 (99.73%)

498764 (99.4%)

497096 (99,07%)

Incorrect characters (no spaces)

1335 (0,27%)

2971 (0,6%)

4639 (0,93%)

Evaluations in terms of characters, always looks much better (approx 4 times better) than the evaluations in terms of words!

Implementation

• Two versions:– Standalone (everything packed in one executable; rather slow for large MS

Office documents)– Web-service (distributed among various programs and machines; much faster)

• In both versions DIAC+ may work under the user supervision (as classical spell-checkers) or independently– generates a logfile documenting each correction (initial word-form, possible

replacements and the actual one). Optionally, the logfile can include for each replacement the sentence in which it was operated.

• The system can correct a few typographical errors such as transposed characters, wrong typed characters, or omitted characters.

• The MS spell-checker underlines all the unknown words, thus allowing the user to further inspect spelling errors which are out of reach for DIAC+.

diac+: a professional diacritics recovering system

Documents