short text language detection with infinity-gram

66
Short Text Language Detection with Infinity-Gram 2012/05/14 NAIST Seminar Nakatani Shuyo @ Cybozu Labs Inc

Upload: shuyo-nakatani

Post on 01-Nov-2014

57 views

Category:

Technology


8 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Short Text Language Detection with Infinity-Gram

Short Text Language Detection with Infinity-Gram

20120514 NAIST Seminar

Nakatani Shuyo Cybozu Labs Inc

Agenda

bull Language Detection

bull Proposal Method

ndash Maximal Substring

bull Corpus

bull Implementation and Estimations

bull Conclusions

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 4

Language Detection

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 5

In What Language

bull Ik kan er nooit tegen als mensen me negeren

bull Aha ich seh angeblich suumlszlig aus

bull Czy moacutegłbym zasnąć w przedmieściach Twoich myśli

bull Ah Tak Saring skal jeg bare finde ud af hvordan

bull Det er ikke saring digg nei aring vi som har finale til helgaSkrekk og gru Takk )

bull tack kompis Hade faktiskt taumlnkt maila dig paring fb och fraringga vart du tog vaumlgen

bull Ccedilok doğru En buumlyuumlk hatayı yaptım

bull Icircncacircntat de cunoștință

bull Một người dacircn bị thương vagrave bốn người mất tiacutech sau khi một ngọn nuacutei lửa ở miền trung

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 6

Hints

bull Dutch if there is ik

bull German if there is ich or a letter szlig

bull Polish if there is czy or letters Ł ń ś or ź

bull Scandinavian if there is a letter aring

ndash Danish if there is af Tak means thanks

ndash Norwegian if there is nei Takk means thanks

ndash Swedish if there is och Tack means thanks

bull Turkish if there is a letter ı ( i without point) or ğ

bull Romanian if there is a letter ă or ș or ț

ndash Although ă is also used in Vietnamese it is easy to distinguish them

ndash Although ş is also used in Turkish it is easy to distinguish them

bull Vietnamese if there are many unreadable letters on WinXP P

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 7

In What Language (Solution)

bull Ik kan er nooit tegen als mensen me negeren Dutch

bull Aha ich seh angeblich suumlszlig aus German

bull Czy moacutegłbym zasnąć w przedmieściach Twoich myśli Polish

bull Ah Tak Saring skal jeg bare finde ud af hvordan Danish

bull Det er ikke saring digg nei aring vi som har finale til helgaSkrekk og gru Takk ) Norwegian

bull tack kompis Hade faktiskt taumlnkt maila dig paring fb och fraringga vart du tog vaumlgen Swedish

bull Ccedilok doğru En buumlyuumlk hatayı yaptım Turkish

bull Icircncacircntat de cunoștință Rumanian

bull Một người dacircn bị thương vagrave bốn người mất tiacutech sau khi một ngọn nuacutei lửa ở miền trung Vietnamese

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 8

Whats Language Detection

bull To detect what language the input text written in

ndash Time fries like arrow rarr English

ndash Buona sera rarr Italian

bull It is prior for many language processing tasks

ndash Language model is built for each language

ndash Text search classification extraction translation

bull It is possible to detect for long enough and noiseless text with more than 99 accuracy [Cavnar+ 94]

ndash 3-gram model is used in many methods

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 9

SPAM or not

bull It is necessary to know that it is written in Polish

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 10

Document Categorization with Naive Bayes Classifier

bull Categorize a document 119883 = (119883119894) into category 119862119896

ndash A document 119883 is represented as collection of words 119883119894 (bag-of-words)

bull Word probability assumes conditionally independent on each category

ndash 119901 119883 119862119896 = 119901 119883119894 119862k119894 (from independent hypothesis)

ndash where 119901(119883119894|119862) rate of word frequency for category

bull Estimate the category 119862k to maximize posterior

ndash 119901 119862k 119883 =119901 119883 119862k 119901 119862k

119901 119883prop 119901(119862k) 119901(119883119894|119862k)119894

ndash where 119901(119862k) prior for category

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 11

Language Detection with Naive Bayes Classifier

bull Document categorization with language

labels

ndash Categorize documents into English Japanese

and so on

bull Use character n-gram as features

ndash Unicode code point n-gram strictly speaking

ndash Assume character encoding of the document is

already known

bull Most applications know encoding of inside text data

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 12

Why Use n-Gram to Detect Language

bull Each language has proper characters and spelling rules

ndash ldquoeacuterdquo is often used in Spanish Italian and so on but not in English in principle

ndash There are many words which start with ldquoZrdquo in German but not in English

ndash There are many words which start with ldquoCrdquo in English but not in German

ndash Spelling ldquoThrdquo is often used in English but not in the other languages

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 13

T h i s T h i s larr1-gram

T Th hi is s larr2-gram

Th Thi his is larr3-gram

C L Z Th

English 075 047 002 074

German 010 037 053 003

French 038 069 001 001

language-detection(langdetect) (Nakatani 2010)

bull Language detection library for Java

ndash httpcodegooglecomplanguage-detection

ndash Apache License 20

ndash Character 3-gram + Bayesian filter

ndash Various normalizations + Feature sampling

bull 99 over precision for 53 languages

ndash Training with Wikipedia abstract

ndash Widely support including Asian languages

ndash Adopted by Apache Solr

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 14

Estimation with News Text

bull Test for crawled news text from web in 49 languages Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 15

Language size accuracyaf Afrikaans 200 199 (9950)ar Arabic 200 200 (10000)bg Bulgarian 200 200 (10000)bn Bengali 200 200 (10000)cs Czech 200 200 (10000)da Dannish 200 179 (8950)de German 200 200 (10000)el Greek 200 200 (10000)en English 200 200 (10000)es Spanish 200 200 (10000)fa Persian 200 200 (10000)fi Finnish 200 200 (10000)fr French 200 200 (10000)gu Gujarati 200 200 (10000)he Hebrew 200 200 (10000)hi Hindi 200 200 (10000)hr Croatian 200 200 (10000)hu Hungarian 200 200 (10000)id Indonesian 200 200 (10000)it Italian 200 200 (10000)ja Japanese 200 200 (10000)kn Kannada 200 200 (10000)ko Korean 200 200 (10000)mk Macedonian 200 200 (10000)ml Malayalam 200 200 (10000)

Language size accuracymr Marathi 200 200 (10000)ne Nepali 200 200 (10000)nl Dutch 200 200 (10000)no Norwegian 200 199 (9950)pa Punjabi 200 200 (10000)pl Polish 200 200 (10000)pt Portuguese 200 200 (10000)ro Romanian 200 200 (10000)ru Russian 200 200 (10000)sk Slovak 200 200 (10000)so Somali 200 200 (10000)sq Albanian 200 200 (10000)sv Swedish 200 200 (10000)sw Swahili 200 200 (10000)ta Tamil 200 200 (10000)te Telugu 200 200 (10000)th Thai 200 200 (10000)tl Tagalog 200 200 (10000)tr Turkish 200 200 (10000)uk Ukrainian 200 200 (10000)ur Urdu 200 200 (10000)vi Vietnamese 200 200 (10000)

zh-cn Simplified Chinese 200 200 (10000)zh-tw Traditional Chinese 200 200 (10000)

total 9800 9777 (9977)

Estimation with Europarl datasets

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 16

bull Test for 1000 samples for each

language from Europarl Parallel Corpus

ndash from the proceedings of the European Parliament

ndash httpwwwstatmtorgeuroparl

bull httpcodegooglecomplanguage-

detectiondownloadsdetailname=eur

oparl-testzip

language size correct accuracybg Bulgarian 1000 988 988cs Czech 1000 994 994da Dannish 1000 968 968de German 1000 998 998el Greek 1000 1000 1000en English 1000 996 996es Spanish 1000 996 996et Estonian 1000 996 996fi Finnish 1000 998 998fr French 1000 999 999hu Hungarian 1000 999 999it Italian 1000 999 999lt Lithuanian 1000 997 997lv Latvian 1000 999 999nl Dutch 1000 974 974pl Polish 1000 999 999pt Portuguese 1000 996 996ro Romanian 1000 999 999sk Slovak 1000 988 988sl Slovene 1000 976 976sv Swedish 1000 991 991

total 21000 20850 993

Language Detection has been over isnt it

17

We still have ENEMY to beat

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 18

Twitter Language Detection with the Existing Methods

bull Only 90-95 accuracy

for tweet corpus

bull LD = language-detection

bull CLD = Chromium Compact Language

Detection

ndash httpcodegooglecompchromium-

compact-language-detector

ndash regard ms(Malay) as id(Indonesian)

bull Tika = Apache Tika

ndash httptikaapacheorg

ndash Estimate on 15 languages which Tika

supports in our tweet corpus

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 19

language LD CLD Tikaca Catalan 953 930 838cs Czech 963 966 ----da Dannish 945 907 587de German 866 968 731en English 883 974 547es Spanish 915 905 444fi Finnish 989 994 948fr French 950 945 674hu Hungarian 858 890 762id Indonesian 897 928 ----it Italian 962 938 871nl Dutch 695 932 650no Norwegian 960 749 686pl Polish 980 978 888pt Portuguese 880 886 474ro Romanian 928 961 826sv Swedish 960 964 756tr Turkish 976 974 ----vi Vietnamese 987 989 ----

total 922 938 700

Chromium Compact Language Detection (CLD)

bull Porting the language detector from

Google Chromium ndash httpcodegooglecompchromium-compact-language-detector

ndash Implementation in C++ Python binding

ndash of supported languages CLD = 76

langdetect = 53

ndash Accuracy CLD = 9882 langdetect =

9922

bull for 17 languages on Europarl datasets bull httpblogmikemccandlesscom201110accuracy-and-performance-of-googleshtml

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 20

Is twitter Language Detection difficult (1)

bull Tweet is too short to extract 3-gram features

ndash At most 140 characters on twitter

ndash URLs mentions and hashtags are not useful to

detect

bull LIGA [Tromp+ 11]

ndash Graph-features based on 3-gram

bull Add long distance features

bull 95~98 accuracy for twitter Language Detection

bull 6 languages (de en es fr it nl)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 21

Is twitter Language Detection difficult (2)

bull Tweet is too noisy

ndash Representations against the languages orthography often appear

ndash Acronym Abbreviation lengthened word (like Cooooolll)

bull Likelihood of tweet tends to get smaller on normal language model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

OMG Oh My God

LOL Laughing Out Loud

LMAO Laughing My Ass Out

F4F Follow for Follow

MDR Mort de Rire (French)

TKT Ne tlsquoInquiegravete Pas (Fr)

u you

ur your

4 for

i0u I love you

k che (Italian)

anke anche(Italian)

Letter k isnt used in Italian

22

Motivation to Detect Short Text Language

bull There are many small chunks of text in addition to twitter

ndash Schedule search query bulletin board and so on

ndash There are many questions about short text detection in the Issues Board of langdetect Project

bull httpcodegooglecomplanguage-detectionissuesdetailid=10

bull Detection for multi-language mixed text

ndash Cut the target document in paragraphs or lines

ndash Detect for each short text

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 23

Our Goal

bull Over 99 accuracy

ndash However it is too difficult to detect one

word sentence

ndash Our Goal is 99+ accurate detection for

sentence with more than 3 words

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 24

We need

bull Rich feature extractable model from

short text

ndash Maximal substring model

(infin-gram Logistic Regression)

bull and twitter-specific Language model

or Corpus to construct it

ndash about 700K tweet corpus with language

label

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 25

Proposal Method

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 26

How to increase features from 3-grams

bull The more n the more features

bull Maximum at n=infin that is all substring

ndash But it has O(T2) order

gram of n-gram

freq≧1 freq≧2 freq≧10

1 79 72 57

2 1896 1533 902

3 15970 10369 4525

4 64966 33941 10534

5 167543 69719 15538

6 323749 107861 18970

7 524634 142954 21093

8 760719 171995 22159

9 921361 193995 22696

cumulative distributuion of feature length for 5090 normalized English tweets (300KB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 27

Text Categorization with All Substring Features [Okanohara+ 09]

bull Multiclass Logistic Regression using all

substrings as features

ndash Maximal Substring makes the equivalent

model that can be constructed in linear

time

ndash Store features into TRIE fast prediction

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 28

Maximal Substring (1)

bull Define a containment(semi-order)

among non empty substrings

abracadabra

ndash ldquorardquo sub ldquobraldquo hArr all rdquorardquo occur

as the substring of ldquobrardquo

ndash ldquoardquo nsub ldquoraldquo hArr ldquoardquo occur in not only ldquoraldquo

but also ldquocardquo It is strictly defined with also its position in the substring

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 29

Maximal Substring (2)

bull Each equivalent class formed by the containment relationship has a unique maximal element that is named Maximal Substring

bull Maximal substrings of abracadabra are a abra and abracadabra

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 30

via httpdhatenanejpnokuno201202031328237067

Maximal Substring and Infinity-Gram

bull Frequencies of substrings that have a containment relationship always equal

bull In the model with linear combination of features it is possible to enclose the common feature values

bull Logistic regression with maximal substrings is equivalent to the one with infinity-grams

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Although the equivalence collapses for test set

we assumes that it can be approximated by a sufficiently large training set

31

Extended Suffix Array

bull Extended Suffix Array consists of

ndash SA=Suffix Array

ndash L=Longest Common Prefixes

ndash B=Burrows-Wheelers Transformed text

bull A maximal substring that occurs more than once corresponds to a internal node of Suffix Tree which is equivalent to a suffix with Lgt0 and BWT has more than 1 character type

ndash They can be calculated on linear time

bull esaxx Okanoharas implement of ESA

ndash httpcodegooglecompesaxx

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 32

via [Okanohara+ 09]

Corpus and Normalization

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 33

Target Languages

bull Limit character type to detect

ndash In short text detection mixed text can be

divided to type of characters

bull Latin alphabet language

ndash The most difficult alphabet type to detect

ndash Languages which speakers are over 5

million are more than 25

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 34

Whats Latin Alphabet

bull Latin alphabet ne ascii alphabet

ndash aring ą aelig eth Ħ ŋ and so on

bull They are assigned to 9 code blocks in Unicode

Range Name Supplement

U+0000-007F Basic Latin ascii

U+0080-00FF Latin-1 Supplement Most languages are covered with these U+0100-017F Latin Extended-A

U+0180-024F Latin Extended-B Rumanian

U+0250-02AF IPA Extensions

U+0300-036F Combining Diacritical Marks for tone symbol composition

U+1E00-1EFF Latin Extended Additional Vietnamese

U+2C60-2C7F Latin Extended-C These arenrsquot used by almost all present languages U+A720-A7FF Latin Extended-D

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 35

Latin Alphabets in Unicode Codepoint Chart

for Vietnamese only use often use sometimes

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 36

How to Create Corpus

bull Collect tweets with sample method of

twitter Streaming API

ndash Sampling 1 of all tweets (about 2

million tweets)

ndash Tweets in Latin alphabet language

account for 60 of them

bull The rest is only to annotate language

labels to these tweets

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 37

Language Label Annotation

bull Group tweets by their timezone

ndash French tweets account for about 1 of all ones

ndash But they account for 50 of ones in Paris

timezone only

bull Annotate tentative labels to tweets using

langdetect

ndash Remove non-French tweets from ones labeled lsquofrrsquo

ndash Recover French tweets from ones not labeled lsquofrrsquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 38

( 20 of the whole tweets have no timezone)

How to annotate

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 39

Swedish Norwegian Danish Vietnamese Lithuanian

Czech Hungarian Catalan Rumanian and Polish guides in turn

Created Corpus

bull Noiseless tweets for training data

bull Noiseful tweets with more than 3 words as test data

bull Work with Rauacutel Velaz and Hiroshi Manabe for Catalan corpus creation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 40

language training testca Catalan 9089 5082cs Czech 9082 7682da Dannish 7388 5524de German 44448 10065en English 44520 10168es Spanish 44118 10265fi Finnish 8087 7050fr French 44339 10098hu Hungarian 10030 4904id Indonesian 44722 10181it Italian 43366 10152nl Dutch 44682 10007no Norwegian 10124 8496pl Polish 16771 10152pt Portuguese 44215 10208ro Romanian 10021 5911sv Swedish 44054 10032tr Turkish 44703 10308vi Vietnamese 15030 10488

total 538789 166773

Simple Language Detection

bull Language detector can be constructed

from maximal substring model and

twitter corpus

ndash It still gets at most 98 accuracy

bull We guess it is necessary to reduce bias

ndash data size bias

ndash language-specific bias

ndash twitter-specific bias

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 41

Bias by Data Size

bull Tweet size in each language has huge bias

bull Level them out by sampling with replacement from each language up to the largest data

ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement

English

Portuguese

Spanish

Indonesian

Dutch

French

German

Turkish

Italian

Swedish

othersShort Text Language Detection with Infinity-Gram

(NAIST Seminar) 42

Convert to Lowercase on Multiple Languages

bull Conversion into lower case saves corpus and compresses model

bull But the lower case of I (U+0049) in Turkish differs from others

bull Convert to lower case excluding lsquoIrsquo

Upper case Lower case

Turkish

Azerbaijani

I (U+0049) ı (U+0131)

İ (U+0130) i (U+0069)

Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 43

Normalization for Rumanian

bull Rumanian uses acirc ă icirc ș ț in addition to a-z

bull There are 2 character type as st with a ldquobeardrdquo

ndash U+015E-F U+0162-3 st with cedilla

ndash U+0218-B st with comma below

bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia

bull The 2 code has the same design in some fonts

ndash Indistinguishable

ș ş U+0219 U+015F

ț ţ U+021B U+0163

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

44

Rumanian Character Affairs on PC

bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently

ndash 1989 Democratization in Rumania

ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode

ndash 2007 Rumania seated in the EU

ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 45

lsquost with cedillarsquo is used

on an advertisement board

in Bucharest

Normalization for Substitute Characters

bull lsquost with cedillarsquo are substitute characters

ndash But they are more popular than the others

ndash with cedilla with comma = 2 1

ndash ldquoRumanian IMErdquo outputs the substitutes too D

bull Regard lsquost with commarsquo as lsquost with cedillarsquo

ț ţ U+021B U+0163

I reckon it is similar to the relationship of

Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 46

Arabic Character Normalization (on language-detection)

bull Arabic and Persian have the similar trouble too

bull Character lsquoyehrsquo in Farsi corresponds to 2 code points

ndash Wikipedia uses ی (U+06cc Farsi yeh) only

ndash News uses ي(U+064a Arabic yeh) only

bull U+064a is a substitute in Farsi

ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc

ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails

bull Regard U+06cc as U+064a

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 47

Normalization for Vietnamese (1)

bull Vietnamese has 12 vowels

ndash a ă acirc e ecirc i y o ocirc ơ u ư

bull Vietnamese has 6 tones

ndash a ả agrave atilde aacute ạ

ndash These tone symbols are used also in general documents like news

bull The tone symbols can be appended to all vowels

ndash 12 6 = 72

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 48

Normalization for Vietnamese (2)

bull Representation of vowels with

tones

1 Use U+1ea0 - U+1ef9

bull ẵ = U+1eb5

2 Combine with Diacritical Marks

bull ẵ = U+0103 U+0303

ndash Half and half on news and tweet

bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 49

CJK-Kanji Normalization (1) (on language-detection)

bull CJK-Kanji has too many characters(more than 20K)

ndash Other character types have only 30-50 characters

bull The character space is very sparse

ndash Characters that donrsquot occur in the training corpus have no probabilities

bull eg 谢谢 Kanji for person name

ndash Common frequent characters are too strong

bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese

bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 50

CJK-Kanji Normalization (2) (on language-detection)

bull Group Kanjis by frequency and normalize each group to the representative character

ndash (1) K-means clustering

bull Use tf-idf on Wikipedia and Google News

bull K=50 (size of ascii alphabet = 52)

ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese

bull Simplified Chinese 现代汉语常用字表(3500)

bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)

bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998

ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much

bull Generate 130 clusters from product of (1) and (2)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 51

Normalization for twitter

bull Remove simply

ndash URL

ndash mention

ndash hash tag

ndash RT

ndash face mark using alphabet like XD p

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 52

Normalization for twitter-Specific Representation

bull How to Like lsquocoooooooollllllrsquo

bull Case 1 Make a normalization dictionary using [Brody+ 2011]

ndash Unsupervised normalization like coooollll rarr cool

ndash It canrsquot handle words that are not in the dictionary

bull Case 2 If the same character continues in more than 3 Shrink it to 2

ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of

bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on

bull Acronym (like WWW СССР) is not useful for language detection

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 53

Laugh Normalization

bull There are various laughs on each language

ndash HOW MUCH DO YOU LOVE COACH BEISTE

HHAHAHAHAHAH

ndash Hihihihi ) Habe ich regulaumlr 2x die Woche

ndash Tafil con eso Jajajajajajaja

ndash Malo Jejejeje XP

ndash kekeke chỗ đoacute lagravem aacuteo được ko em

bull Shrink them to double

ndash hahahha rArr haha

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 54

Implementation and Estimation

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 55

Language Detection with Infinity-Gram (ldig)

bull tweet language detection for Latin

alphabet

ndash httpsgithubcomshuyoldig

bull MIT license

bull Distribute also the trained model here

ndash infin-gram LR(maximal substring) [Okanohara+ 09]

ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]

ndash Double Array

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 56

Usage (1) Model Initialization

bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]

ndash Extract features from corpus and initialize model

ndash -m model directory

ndash -x path of maximal substring extractor (execute as external process)

ndash --ff Ignore less than the specified value

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 57

Maximal String Extractor

bull maxsubst [input file] [output file]

ndash Input as multiple line text

bull Replace TABs to ldquo ldquo line feeds to U+0001 in it

ndash Output as rdquo[features]yent[frequency]rdquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 58

Usage (2) Learn

bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]

ndash Learn the model using the corpus on 1 cycle of SGD

ndash -e learning rate of SGD

ndash -r regularizer of L1 regularization

ndash --wr what times to regularize for whole parameters

bull Parameters are too many to regularize the whle ones every step

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 59

Usage (3) Shrink Model

bull ldigpy -m [model] --shrink

ndash Remove Unefficient features(all

parameters of which are 0) from the

model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 60

Usage (4) Detect Language

bull ldigpy -m [model] [test data]

ndash Detect languages of test data and output

its result and summary

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 61

Data Format

bull Training and test data

ndash [correct label]yent[meta data]yent[text]

en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD

62

Usage (5) Estimation Tool

bull serverpy -m [model] -p [port number]

ndash Open httplocalhost[port] after it is executed

ndash Output their language probabilities contained features and their parameters for a text inputed in the text area

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 63

Estimation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus

As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size

64

language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992

total 166711 165053 9901 922 974

Estimation for LIGA dataset

bull Estimate using LIGA[Tromp+ 11] dataset

with 9066 tweets for 6 languages

ndash httpwwwwintuenl~mpechenprojectssmm

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Use 19 language model

65

Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996

total 9066 8992 992

Estimation for Europarl Dataset

Only supported languages for ldig

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 66

ldig langdetect CLDlanguage size correct rate correct rate correct rate

bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993

total 21000 13957 997 20850 993 20814 991

Conclusions

bull Language detector using maximal substring model

ndash Detect over 99 accuracy for 19 languages

ndash langdetect with tweet corpus even has 97 accuracy

bull If the corpus is maintained the precision will be still up

ndash There are still many mistakes (in particular da and no)

bull If metadata is added to features the precision will be still up

ndash How to add and train metadata at low cost

bull Desire to shrink the model without loss of precision

ndash Too large for application (gt100MB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 67

References

bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定

bull [Okanohara+ 09] Text Categorization with All Substring Features

bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs

bull [Cavnar+ 94] N-Gram-Based Text Categorization

bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 68

Page 2: Short Text Language Detection with Infinity-Gram

Agenda

bull Language Detection

bull Proposal Method

ndash Maximal Substring

bull Corpus

bull Implementation and Estimations

bull Conclusions

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 4

Language Detection

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 5

In What Language

bull Ik kan er nooit tegen als mensen me negeren

bull Aha ich seh angeblich suumlszlig aus

bull Czy moacutegłbym zasnąć w przedmieściach Twoich myśli

bull Ah Tak Saring skal jeg bare finde ud af hvordan

bull Det er ikke saring digg nei aring vi som har finale til helgaSkrekk og gru Takk )

bull tack kompis Hade faktiskt taumlnkt maila dig paring fb och fraringga vart du tog vaumlgen

bull Ccedilok doğru En buumlyuumlk hatayı yaptım

bull Icircncacircntat de cunoștință

bull Một người dacircn bị thương vagrave bốn người mất tiacutech sau khi một ngọn nuacutei lửa ở miền trung

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 6

Hints

bull Dutch if there is ik

bull German if there is ich or a letter szlig

bull Polish if there is czy or letters Ł ń ś or ź

bull Scandinavian if there is a letter aring

ndash Danish if there is af Tak means thanks

ndash Norwegian if there is nei Takk means thanks

ndash Swedish if there is och Tack means thanks

bull Turkish if there is a letter ı ( i without point) or ğ

bull Romanian if there is a letter ă or ș or ț

ndash Although ă is also used in Vietnamese it is easy to distinguish them

ndash Although ş is also used in Turkish it is easy to distinguish them

bull Vietnamese if there are many unreadable letters on WinXP P

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 7

In What Language (Solution)

bull Ik kan er nooit tegen als mensen me negeren Dutch

bull Aha ich seh angeblich suumlszlig aus German

bull Czy moacutegłbym zasnąć w przedmieściach Twoich myśli Polish

bull Ah Tak Saring skal jeg bare finde ud af hvordan Danish

bull Det er ikke saring digg nei aring vi som har finale til helgaSkrekk og gru Takk ) Norwegian

bull tack kompis Hade faktiskt taumlnkt maila dig paring fb och fraringga vart du tog vaumlgen Swedish

bull Ccedilok doğru En buumlyuumlk hatayı yaptım Turkish

bull Icircncacircntat de cunoștință Rumanian

bull Một người dacircn bị thương vagrave bốn người mất tiacutech sau khi một ngọn nuacutei lửa ở miền trung Vietnamese

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 8

Whats Language Detection

bull To detect what language the input text written in

ndash Time fries like arrow rarr English

ndash Buona sera rarr Italian

bull It is prior for many language processing tasks

ndash Language model is built for each language

ndash Text search classification extraction translation

bull It is possible to detect for long enough and noiseless text with more than 99 accuracy [Cavnar+ 94]

ndash 3-gram model is used in many methods

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 9

SPAM or not

bull It is necessary to know that it is written in Polish

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 10

Document Categorization with Naive Bayes Classifier

bull Categorize a document 119883 = (119883119894) into category 119862119896

ndash A document 119883 is represented as collection of words 119883119894 (bag-of-words)

bull Word probability assumes conditionally independent on each category

ndash 119901 119883 119862119896 = 119901 119883119894 119862k119894 (from independent hypothesis)

ndash where 119901(119883119894|119862) rate of word frequency for category

bull Estimate the category 119862k to maximize posterior

ndash 119901 119862k 119883 =119901 119883 119862k 119901 119862k

119901 119883prop 119901(119862k) 119901(119883119894|119862k)119894

ndash where 119901(119862k) prior for category

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 11

Language Detection with Naive Bayes Classifier

bull Document categorization with language

labels

ndash Categorize documents into English Japanese

and so on

bull Use character n-gram as features

ndash Unicode code point n-gram strictly speaking

ndash Assume character encoding of the document is

already known

bull Most applications know encoding of inside text data

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 12

Why Use n-Gram to Detect Language

bull Each language has proper characters and spelling rules

ndash ldquoeacuterdquo is often used in Spanish Italian and so on but not in English in principle

ndash There are many words which start with ldquoZrdquo in German but not in English

ndash There are many words which start with ldquoCrdquo in English but not in German

ndash Spelling ldquoThrdquo is often used in English but not in the other languages

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 13

T h i s T h i s larr1-gram

T Th hi is s larr2-gram

Th Thi his is larr3-gram

C L Z Th

English 075 047 002 074

German 010 037 053 003

French 038 069 001 001

language-detection(langdetect) (Nakatani 2010)

bull Language detection library for Java

ndash httpcodegooglecomplanguage-detection

ndash Apache License 20

ndash Character 3-gram + Bayesian filter

ndash Various normalizations + Feature sampling

bull 99 over precision for 53 languages

ndash Training with Wikipedia abstract

ndash Widely support including Asian languages

ndash Adopted by Apache Solr

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 14

Estimation with News Text

bull Test for crawled news text from web in 49 languages Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 15

Language size accuracyaf Afrikaans 200 199 (9950)ar Arabic 200 200 (10000)bg Bulgarian 200 200 (10000)bn Bengali 200 200 (10000)cs Czech 200 200 (10000)da Dannish 200 179 (8950)de German 200 200 (10000)el Greek 200 200 (10000)en English 200 200 (10000)es Spanish 200 200 (10000)fa Persian 200 200 (10000)fi Finnish 200 200 (10000)fr French 200 200 (10000)gu Gujarati 200 200 (10000)he Hebrew 200 200 (10000)hi Hindi 200 200 (10000)hr Croatian 200 200 (10000)hu Hungarian 200 200 (10000)id Indonesian 200 200 (10000)it Italian 200 200 (10000)ja Japanese 200 200 (10000)kn Kannada 200 200 (10000)ko Korean 200 200 (10000)mk Macedonian 200 200 (10000)ml Malayalam 200 200 (10000)

Language size accuracymr Marathi 200 200 (10000)ne Nepali 200 200 (10000)nl Dutch 200 200 (10000)no Norwegian 200 199 (9950)pa Punjabi 200 200 (10000)pl Polish 200 200 (10000)pt Portuguese 200 200 (10000)ro Romanian 200 200 (10000)ru Russian 200 200 (10000)sk Slovak 200 200 (10000)so Somali 200 200 (10000)sq Albanian 200 200 (10000)sv Swedish 200 200 (10000)sw Swahili 200 200 (10000)ta Tamil 200 200 (10000)te Telugu 200 200 (10000)th Thai 200 200 (10000)tl Tagalog 200 200 (10000)tr Turkish 200 200 (10000)uk Ukrainian 200 200 (10000)ur Urdu 200 200 (10000)vi Vietnamese 200 200 (10000)

zh-cn Simplified Chinese 200 200 (10000)zh-tw Traditional Chinese 200 200 (10000)

total 9800 9777 (9977)

Estimation with Europarl datasets

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 16

bull Test for 1000 samples for each

language from Europarl Parallel Corpus

ndash from the proceedings of the European Parliament

ndash httpwwwstatmtorgeuroparl

bull httpcodegooglecomplanguage-

detectiondownloadsdetailname=eur

oparl-testzip

language size correct accuracybg Bulgarian 1000 988 988cs Czech 1000 994 994da Dannish 1000 968 968de German 1000 998 998el Greek 1000 1000 1000en English 1000 996 996es Spanish 1000 996 996et Estonian 1000 996 996fi Finnish 1000 998 998fr French 1000 999 999hu Hungarian 1000 999 999it Italian 1000 999 999lt Lithuanian 1000 997 997lv Latvian 1000 999 999nl Dutch 1000 974 974pl Polish 1000 999 999pt Portuguese 1000 996 996ro Romanian 1000 999 999sk Slovak 1000 988 988sl Slovene 1000 976 976sv Swedish 1000 991 991

total 21000 20850 993

Language Detection has been over isnt it

17

We still have ENEMY to beat

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 18

Twitter Language Detection with the Existing Methods

bull Only 90-95 accuracy

for tweet corpus

bull LD = language-detection

bull CLD = Chromium Compact Language

Detection

ndash httpcodegooglecompchromium-

compact-language-detector

ndash regard ms(Malay) as id(Indonesian)

bull Tika = Apache Tika

ndash httptikaapacheorg

ndash Estimate on 15 languages which Tika

supports in our tweet corpus

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 19

language LD CLD Tikaca Catalan 953 930 838cs Czech 963 966 ----da Dannish 945 907 587de German 866 968 731en English 883 974 547es Spanish 915 905 444fi Finnish 989 994 948fr French 950 945 674hu Hungarian 858 890 762id Indonesian 897 928 ----it Italian 962 938 871nl Dutch 695 932 650no Norwegian 960 749 686pl Polish 980 978 888pt Portuguese 880 886 474ro Romanian 928 961 826sv Swedish 960 964 756tr Turkish 976 974 ----vi Vietnamese 987 989 ----

total 922 938 700

Chromium Compact Language Detection (CLD)

bull Porting the language detector from

Google Chromium ndash httpcodegooglecompchromium-compact-language-detector

ndash Implementation in C++ Python binding

ndash of supported languages CLD = 76

langdetect = 53

ndash Accuracy CLD = 9882 langdetect =

9922

bull for 17 languages on Europarl datasets bull httpblogmikemccandlesscom201110accuracy-and-performance-of-googleshtml

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 20

Is twitter Language Detection difficult (1)

bull Tweet is too short to extract 3-gram features

ndash At most 140 characters on twitter

ndash URLs mentions and hashtags are not useful to

detect

bull LIGA [Tromp+ 11]

ndash Graph-features based on 3-gram

bull Add long distance features

bull 95~98 accuracy for twitter Language Detection

bull 6 languages (de en es fr it nl)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 21

Is twitter Language Detection difficult (2)

bull Tweet is too noisy

ndash Representations against the languages orthography often appear

ndash Acronym Abbreviation lengthened word (like Cooooolll)

bull Likelihood of tweet tends to get smaller on normal language model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

OMG Oh My God

LOL Laughing Out Loud

LMAO Laughing My Ass Out

F4F Follow for Follow

MDR Mort de Rire (French)

TKT Ne tlsquoInquiegravete Pas (Fr)

u you

ur your

4 for

i0u I love you

k che (Italian)

anke anche(Italian)

Letter k isnt used in Italian

22

Motivation to Detect Short Text Language

bull There are many small chunks of text in addition to twitter

ndash Schedule search query bulletin board and so on

ndash There are many questions about short text detection in the Issues Board of langdetect Project

bull httpcodegooglecomplanguage-detectionissuesdetailid=10

bull Detection for multi-language mixed text

ndash Cut the target document in paragraphs or lines

ndash Detect for each short text

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 23

Our Goal

bull Over 99 accuracy

ndash However it is too difficult to detect one

word sentence

ndash Our Goal is 99+ accurate detection for

sentence with more than 3 words

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 24

We need

bull Rich feature extractable model from

short text

ndash Maximal substring model

(infin-gram Logistic Regression)

bull and twitter-specific Language model

or Corpus to construct it

ndash about 700K tweet corpus with language

label

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 25

Proposal Method

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 26

How to increase features from 3-grams

bull The more n the more features

bull Maximum at n=infin that is all substring

ndash But it has O(T2) order

gram of n-gram

freq≧1 freq≧2 freq≧10

1 79 72 57

2 1896 1533 902

3 15970 10369 4525

4 64966 33941 10534

5 167543 69719 15538

6 323749 107861 18970

7 524634 142954 21093

8 760719 171995 22159

9 921361 193995 22696

cumulative distributuion of feature length for 5090 normalized English tweets (300KB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 27

Text Categorization with All Substring Features [Okanohara+ 09]

bull Multiclass Logistic Regression using all

substrings as features

ndash Maximal Substring makes the equivalent

model that can be constructed in linear

time

ndash Store features into TRIE fast prediction

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 28

Maximal Substring (1)

bull Define a containment(semi-order)

among non empty substrings

abracadabra

ndash ldquorardquo sub ldquobraldquo hArr all rdquorardquo occur

as the substring of ldquobrardquo

ndash ldquoardquo nsub ldquoraldquo hArr ldquoardquo occur in not only ldquoraldquo

but also ldquocardquo It is strictly defined with also its position in the substring

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 29

Maximal Substring (2)

bull Each equivalent class formed by the containment relationship has a unique maximal element that is named Maximal Substring

bull Maximal substrings of abracadabra are a abra and abracadabra

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 30

via httpdhatenanejpnokuno201202031328237067

Maximal Substring and Infinity-Gram

bull Frequencies of substrings that have a containment relationship always equal

bull In the model with linear combination of features it is possible to enclose the common feature values

bull Logistic regression with maximal substrings is equivalent to the one with infinity-grams

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Although the equivalence collapses for test set

we assumes that it can be approximated by a sufficiently large training set

31

Extended Suffix Array

bull Extended Suffix Array consists of

ndash SA=Suffix Array

ndash L=Longest Common Prefixes

ndash B=Burrows-Wheelers Transformed text

bull A maximal substring that occurs more than once corresponds to a internal node of Suffix Tree which is equivalent to a suffix with Lgt0 and BWT has more than 1 character type

ndash They can be calculated on linear time

bull esaxx Okanoharas implement of ESA

ndash httpcodegooglecompesaxx

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 32

via [Okanohara+ 09]

Corpus and Normalization

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 33

Target Languages

bull Limit character type to detect

ndash In short text detection mixed text can be

divided to type of characters

bull Latin alphabet language

ndash The most difficult alphabet type to detect

ndash Languages which speakers are over 5

million are more than 25

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 34

Whats Latin Alphabet

bull Latin alphabet ne ascii alphabet

ndash aring ą aelig eth Ħ ŋ and so on

bull They are assigned to 9 code blocks in Unicode

Range Name Supplement

U+0000-007F Basic Latin ascii

U+0080-00FF Latin-1 Supplement Most languages are covered with these U+0100-017F Latin Extended-A

U+0180-024F Latin Extended-B Rumanian

U+0250-02AF IPA Extensions

U+0300-036F Combining Diacritical Marks for tone symbol composition

U+1E00-1EFF Latin Extended Additional Vietnamese

U+2C60-2C7F Latin Extended-C These arenrsquot used by almost all present languages U+A720-A7FF Latin Extended-D

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 35

Latin Alphabets in Unicode Codepoint Chart

for Vietnamese only use often use sometimes

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 36

How to Create Corpus

bull Collect tweets with sample method of

twitter Streaming API

ndash Sampling 1 of all tweets (about 2

million tweets)

ndash Tweets in Latin alphabet language

account for 60 of them

bull The rest is only to annotate language

labels to these tweets

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 37

Language Label Annotation

bull Group tweets by their timezone

ndash French tweets account for about 1 of all ones

ndash But they account for 50 of ones in Paris

timezone only

bull Annotate tentative labels to tweets using

langdetect

ndash Remove non-French tweets from ones labeled lsquofrrsquo

ndash Recover French tweets from ones not labeled lsquofrrsquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 38

( 20 of the whole tweets have no timezone)

How to annotate

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 39

Swedish Norwegian Danish Vietnamese Lithuanian

Czech Hungarian Catalan Rumanian and Polish guides in turn

Created Corpus

bull Noiseless tweets for training data

bull Noiseful tweets with more than 3 words as test data

bull Work with Rauacutel Velaz and Hiroshi Manabe for Catalan corpus creation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 40

language training testca Catalan 9089 5082cs Czech 9082 7682da Dannish 7388 5524de German 44448 10065en English 44520 10168es Spanish 44118 10265fi Finnish 8087 7050fr French 44339 10098hu Hungarian 10030 4904id Indonesian 44722 10181it Italian 43366 10152nl Dutch 44682 10007no Norwegian 10124 8496pl Polish 16771 10152pt Portuguese 44215 10208ro Romanian 10021 5911sv Swedish 44054 10032tr Turkish 44703 10308vi Vietnamese 15030 10488

total 538789 166773

Simple Language Detection

bull Language detector can be constructed

from maximal substring model and

twitter corpus

ndash It still gets at most 98 accuracy

bull We guess it is necessary to reduce bias

ndash data size bias

ndash language-specific bias

ndash twitter-specific bias

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 41

Bias by Data Size

bull Tweet size in each language has huge bias

bull Level them out by sampling with replacement from each language up to the largest data

ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement

English

Portuguese

Spanish

Indonesian

Dutch

French

German

Turkish

Italian

Swedish

othersShort Text Language Detection with Infinity-Gram

(NAIST Seminar) 42

Convert to Lowercase on Multiple Languages

bull Conversion into lower case saves corpus and compresses model

bull But the lower case of I (U+0049) in Turkish differs from others

bull Convert to lower case excluding lsquoIrsquo

Upper case Lower case

Turkish

Azerbaijani

I (U+0049) ı (U+0131)

İ (U+0130) i (U+0069)

Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 43

Normalization for Rumanian

bull Rumanian uses acirc ă icirc ș ț in addition to a-z

bull There are 2 character type as st with a ldquobeardrdquo

ndash U+015E-F U+0162-3 st with cedilla

ndash U+0218-B st with comma below

bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia

bull The 2 code has the same design in some fonts

ndash Indistinguishable

ș ş U+0219 U+015F

ț ţ U+021B U+0163

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

44

Rumanian Character Affairs on PC

bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently

ndash 1989 Democratization in Rumania

ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode

ndash 2007 Rumania seated in the EU

ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 45

lsquost with cedillarsquo is used

on an advertisement board

in Bucharest

Normalization for Substitute Characters

bull lsquost with cedillarsquo are substitute characters

ndash But they are more popular than the others

ndash with cedilla with comma = 2 1

ndash ldquoRumanian IMErdquo outputs the substitutes too D

bull Regard lsquost with commarsquo as lsquost with cedillarsquo

ț ţ U+021B U+0163

I reckon it is similar to the relationship of

Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 46

Arabic Character Normalization (on language-detection)

bull Arabic and Persian have the similar trouble too

bull Character lsquoyehrsquo in Farsi corresponds to 2 code points

ndash Wikipedia uses ی (U+06cc Farsi yeh) only

ndash News uses ي(U+064a Arabic yeh) only

bull U+064a is a substitute in Farsi

ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc

ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails

bull Regard U+06cc as U+064a

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 47

Normalization for Vietnamese (1)

bull Vietnamese has 12 vowels

ndash a ă acirc e ecirc i y o ocirc ơ u ư

bull Vietnamese has 6 tones

ndash a ả agrave atilde aacute ạ

ndash These tone symbols are used also in general documents like news

bull The tone symbols can be appended to all vowels

ndash 12 6 = 72

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 48

Normalization for Vietnamese (2)

bull Representation of vowels with

tones

1 Use U+1ea0 - U+1ef9

bull ẵ = U+1eb5

2 Combine with Diacritical Marks

bull ẵ = U+0103 U+0303

ndash Half and half on news and tweet

bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 49

CJK-Kanji Normalization (1) (on language-detection)

bull CJK-Kanji has too many characters(more than 20K)

ndash Other character types have only 30-50 characters

bull The character space is very sparse

ndash Characters that donrsquot occur in the training corpus have no probabilities

bull eg 谢谢 Kanji for person name

ndash Common frequent characters are too strong

bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese

bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 50

CJK-Kanji Normalization (2) (on language-detection)

bull Group Kanjis by frequency and normalize each group to the representative character

ndash (1) K-means clustering

bull Use tf-idf on Wikipedia and Google News

bull K=50 (size of ascii alphabet = 52)

ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese

bull Simplified Chinese 现代汉语常用字表(3500)

bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)

bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998

ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much

bull Generate 130 clusters from product of (1) and (2)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 51

Normalization for twitter

bull Remove simply

ndash URL

ndash mention

ndash hash tag

ndash RT

ndash face mark using alphabet like XD p

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 52

Normalization for twitter-Specific Representation

bull How to Like lsquocoooooooollllllrsquo

bull Case 1 Make a normalization dictionary using [Brody+ 2011]

ndash Unsupervised normalization like coooollll rarr cool

ndash It canrsquot handle words that are not in the dictionary

bull Case 2 If the same character continues in more than 3 Shrink it to 2

ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of

bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on

bull Acronym (like WWW СССР) is not useful for language detection

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 53

Laugh Normalization

bull There are various laughs on each language

ndash HOW MUCH DO YOU LOVE COACH BEISTE

HHAHAHAHAHAH

ndash Hihihihi ) Habe ich regulaumlr 2x die Woche

ndash Tafil con eso Jajajajajajaja

ndash Malo Jejejeje XP

ndash kekeke chỗ đoacute lagravem aacuteo được ko em

bull Shrink them to double

ndash hahahha rArr haha

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 54

Implementation and Estimation

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 55

Language Detection with Infinity-Gram (ldig)

bull tweet language detection for Latin

alphabet

ndash httpsgithubcomshuyoldig

bull MIT license

bull Distribute also the trained model here

ndash infin-gram LR(maximal substring) [Okanohara+ 09]

ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]

ndash Double Array

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 56

Usage (1) Model Initialization

bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]

ndash Extract features from corpus and initialize model

ndash -m model directory

ndash -x path of maximal substring extractor (execute as external process)

ndash --ff Ignore less than the specified value

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 57

Maximal String Extractor

bull maxsubst [input file] [output file]

ndash Input as multiple line text

bull Replace TABs to ldquo ldquo line feeds to U+0001 in it

ndash Output as rdquo[features]yent[frequency]rdquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 58

Usage (2) Learn

bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]

ndash Learn the model using the corpus on 1 cycle of SGD

ndash -e learning rate of SGD

ndash -r regularizer of L1 regularization

ndash --wr what times to regularize for whole parameters

bull Parameters are too many to regularize the whle ones every step

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 59

Usage (3) Shrink Model

bull ldigpy -m [model] --shrink

ndash Remove Unefficient features(all

parameters of which are 0) from the

model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 60

Usage (4) Detect Language

bull ldigpy -m [model] [test data]

ndash Detect languages of test data and output

its result and summary

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 61

Data Format

bull Training and test data

ndash [correct label]yent[meta data]yent[text]

en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD

62

Usage (5) Estimation Tool

bull serverpy -m [model] -p [port number]

ndash Open httplocalhost[port] after it is executed

ndash Output their language probabilities contained features and their parameters for a text inputed in the text area

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 63

Estimation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus

As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size

64

language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992

total 166711 165053 9901 922 974

Estimation for LIGA dataset

bull Estimate using LIGA[Tromp+ 11] dataset

with 9066 tweets for 6 languages

ndash httpwwwwintuenl~mpechenprojectssmm

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Use 19 language model

65

Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996

total 9066 8992 992

Estimation for Europarl Dataset

Only supported languages for ldig

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 66

ldig langdetect CLDlanguage size correct rate correct rate correct rate

bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993

total 21000 13957 997 20850 993 20814 991

Conclusions

bull Language detector using maximal substring model

ndash Detect over 99 accuracy for 19 languages

ndash langdetect with tweet corpus even has 97 accuracy

bull If the corpus is maintained the precision will be still up

ndash There are still many mistakes (in particular da and no)

bull If metadata is added to features the precision will be still up

ndash How to add and train metadata at low cost

bull Desire to shrink the model without loss of precision

ndash Too large for application (gt100MB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 67

References

bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定

bull [Okanohara+ 09] Text Categorization with All Substring Features

bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs

bull [Cavnar+ 94] N-Gram-Based Text Categorization

bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 68

Page 3: Short Text Language Detection with Infinity-Gram

Language Detection

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 5

In What Language

bull Ik kan er nooit tegen als mensen me negeren

bull Aha ich seh angeblich suumlszlig aus

bull Czy moacutegłbym zasnąć w przedmieściach Twoich myśli

bull Ah Tak Saring skal jeg bare finde ud af hvordan

bull Det er ikke saring digg nei aring vi som har finale til helgaSkrekk og gru Takk )

bull tack kompis Hade faktiskt taumlnkt maila dig paring fb och fraringga vart du tog vaumlgen

bull Ccedilok doğru En buumlyuumlk hatayı yaptım

bull Icircncacircntat de cunoștință

bull Một người dacircn bị thương vagrave bốn người mất tiacutech sau khi một ngọn nuacutei lửa ở miền trung

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 6

Hints

bull Dutch if there is ik

bull German if there is ich or a letter szlig

bull Polish if there is czy or letters Ł ń ś or ź

bull Scandinavian if there is a letter aring

ndash Danish if there is af Tak means thanks

ndash Norwegian if there is nei Takk means thanks

ndash Swedish if there is och Tack means thanks

bull Turkish if there is a letter ı ( i without point) or ğ

bull Romanian if there is a letter ă or ș or ț

ndash Although ă is also used in Vietnamese it is easy to distinguish them

ndash Although ş is also used in Turkish it is easy to distinguish them

bull Vietnamese if there are many unreadable letters on WinXP P

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 7

In What Language (Solution)

bull Ik kan er nooit tegen als mensen me negeren Dutch

bull Aha ich seh angeblich suumlszlig aus German

bull Czy moacutegłbym zasnąć w przedmieściach Twoich myśli Polish

bull Ah Tak Saring skal jeg bare finde ud af hvordan Danish

bull Det er ikke saring digg nei aring vi som har finale til helgaSkrekk og gru Takk ) Norwegian

bull tack kompis Hade faktiskt taumlnkt maila dig paring fb och fraringga vart du tog vaumlgen Swedish

bull Ccedilok doğru En buumlyuumlk hatayı yaptım Turkish

bull Icircncacircntat de cunoștință Rumanian

bull Một người dacircn bị thương vagrave bốn người mất tiacutech sau khi một ngọn nuacutei lửa ở miền trung Vietnamese

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 8

Whats Language Detection

bull To detect what language the input text written in

ndash Time fries like arrow rarr English

ndash Buona sera rarr Italian

bull It is prior for many language processing tasks

ndash Language model is built for each language

ndash Text search classification extraction translation

bull It is possible to detect for long enough and noiseless text with more than 99 accuracy [Cavnar+ 94]

ndash 3-gram model is used in many methods

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 9

SPAM or not

bull It is necessary to know that it is written in Polish

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 10

Document Categorization with Naive Bayes Classifier

bull Categorize a document 119883 = (119883119894) into category 119862119896

ndash A document 119883 is represented as collection of words 119883119894 (bag-of-words)

bull Word probability assumes conditionally independent on each category

ndash 119901 119883 119862119896 = 119901 119883119894 119862k119894 (from independent hypothesis)

ndash where 119901(119883119894|119862) rate of word frequency for category

bull Estimate the category 119862k to maximize posterior

ndash 119901 119862k 119883 =119901 119883 119862k 119901 119862k

119901 119883prop 119901(119862k) 119901(119883119894|119862k)119894

ndash where 119901(119862k) prior for category

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 11

Language Detection with Naive Bayes Classifier

bull Document categorization with language

labels

ndash Categorize documents into English Japanese

and so on

bull Use character n-gram as features

ndash Unicode code point n-gram strictly speaking

ndash Assume character encoding of the document is

already known

bull Most applications know encoding of inside text data

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 12

Why Use n-Gram to Detect Language

bull Each language has proper characters and spelling rules

ndash ldquoeacuterdquo is often used in Spanish Italian and so on but not in English in principle

ndash There are many words which start with ldquoZrdquo in German but not in English

ndash There are many words which start with ldquoCrdquo in English but not in German

ndash Spelling ldquoThrdquo is often used in English but not in the other languages

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 13

T h i s T h i s larr1-gram

T Th hi is s larr2-gram

Th Thi his is larr3-gram

C L Z Th

English 075 047 002 074

German 010 037 053 003

French 038 069 001 001

language-detection(langdetect) (Nakatani 2010)

bull Language detection library for Java

ndash httpcodegooglecomplanguage-detection

ndash Apache License 20

ndash Character 3-gram + Bayesian filter

ndash Various normalizations + Feature sampling

bull 99 over precision for 53 languages

ndash Training with Wikipedia abstract

ndash Widely support including Asian languages

ndash Adopted by Apache Solr

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 14

Estimation with News Text

bull Test for crawled news text from web in 49 languages Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 15

Language size accuracyaf Afrikaans 200 199 (9950)ar Arabic 200 200 (10000)bg Bulgarian 200 200 (10000)bn Bengali 200 200 (10000)cs Czech 200 200 (10000)da Dannish 200 179 (8950)de German 200 200 (10000)el Greek 200 200 (10000)en English 200 200 (10000)es Spanish 200 200 (10000)fa Persian 200 200 (10000)fi Finnish 200 200 (10000)fr French 200 200 (10000)gu Gujarati 200 200 (10000)he Hebrew 200 200 (10000)hi Hindi 200 200 (10000)hr Croatian 200 200 (10000)hu Hungarian 200 200 (10000)id Indonesian 200 200 (10000)it Italian 200 200 (10000)ja Japanese 200 200 (10000)kn Kannada 200 200 (10000)ko Korean 200 200 (10000)mk Macedonian 200 200 (10000)ml Malayalam 200 200 (10000)

Language size accuracymr Marathi 200 200 (10000)ne Nepali 200 200 (10000)nl Dutch 200 200 (10000)no Norwegian 200 199 (9950)pa Punjabi 200 200 (10000)pl Polish 200 200 (10000)pt Portuguese 200 200 (10000)ro Romanian 200 200 (10000)ru Russian 200 200 (10000)sk Slovak 200 200 (10000)so Somali 200 200 (10000)sq Albanian 200 200 (10000)sv Swedish 200 200 (10000)sw Swahili 200 200 (10000)ta Tamil 200 200 (10000)te Telugu 200 200 (10000)th Thai 200 200 (10000)tl Tagalog 200 200 (10000)tr Turkish 200 200 (10000)uk Ukrainian 200 200 (10000)ur Urdu 200 200 (10000)vi Vietnamese 200 200 (10000)

zh-cn Simplified Chinese 200 200 (10000)zh-tw Traditional Chinese 200 200 (10000)

total 9800 9777 (9977)

Estimation with Europarl datasets

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 16

bull Test for 1000 samples for each

language from Europarl Parallel Corpus

ndash from the proceedings of the European Parliament

ndash httpwwwstatmtorgeuroparl

bull httpcodegooglecomplanguage-

detectiondownloadsdetailname=eur

oparl-testzip

language size correct accuracybg Bulgarian 1000 988 988cs Czech 1000 994 994da Dannish 1000 968 968de German 1000 998 998el Greek 1000 1000 1000en English 1000 996 996es Spanish 1000 996 996et Estonian 1000 996 996fi Finnish 1000 998 998fr French 1000 999 999hu Hungarian 1000 999 999it Italian 1000 999 999lt Lithuanian 1000 997 997lv Latvian 1000 999 999nl Dutch 1000 974 974pl Polish 1000 999 999pt Portuguese 1000 996 996ro Romanian 1000 999 999sk Slovak 1000 988 988sl Slovene 1000 976 976sv Swedish 1000 991 991

total 21000 20850 993

Language Detection has been over isnt it

17

We still have ENEMY to beat

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 18

Twitter Language Detection with the Existing Methods

bull Only 90-95 accuracy

for tweet corpus

bull LD = language-detection

bull CLD = Chromium Compact Language

Detection

ndash httpcodegooglecompchromium-

compact-language-detector

ndash regard ms(Malay) as id(Indonesian)

bull Tika = Apache Tika

ndash httptikaapacheorg

ndash Estimate on 15 languages which Tika

supports in our tweet corpus

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 19

language LD CLD Tikaca Catalan 953 930 838cs Czech 963 966 ----da Dannish 945 907 587de German 866 968 731en English 883 974 547es Spanish 915 905 444fi Finnish 989 994 948fr French 950 945 674hu Hungarian 858 890 762id Indonesian 897 928 ----it Italian 962 938 871nl Dutch 695 932 650no Norwegian 960 749 686pl Polish 980 978 888pt Portuguese 880 886 474ro Romanian 928 961 826sv Swedish 960 964 756tr Turkish 976 974 ----vi Vietnamese 987 989 ----

total 922 938 700

Chromium Compact Language Detection (CLD)

bull Porting the language detector from

Google Chromium ndash httpcodegooglecompchromium-compact-language-detector

ndash Implementation in C++ Python binding

ndash of supported languages CLD = 76

langdetect = 53

ndash Accuracy CLD = 9882 langdetect =

9922

bull for 17 languages on Europarl datasets bull httpblogmikemccandlesscom201110accuracy-and-performance-of-googleshtml

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 20

Is twitter Language Detection difficult (1)

bull Tweet is too short to extract 3-gram features

ndash At most 140 characters on twitter

ndash URLs mentions and hashtags are not useful to

detect

bull LIGA [Tromp+ 11]

ndash Graph-features based on 3-gram

bull Add long distance features

bull 95~98 accuracy for twitter Language Detection

bull 6 languages (de en es fr it nl)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 21

Is twitter Language Detection difficult (2)

bull Tweet is too noisy

ndash Representations against the languages orthography often appear

ndash Acronym Abbreviation lengthened word (like Cooooolll)

bull Likelihood of tweet tends to get smaller on normal language model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

OMG Oh My God

LOL Laughing Out Loud

LMAO Laughing My Ass Out

F4F Follow for Follow

MDR Mort de Rire (French)

TKT Ne tlsquoInquiegravete Pas (Fr)

u you

ur your

4 for

i0u I love you

k che (Italian)

anke anche(Italian)

Letter k isnt used in Italian

22

Motivation to Detect Short Text Language

bull There are many small chunks of text in addition to twitter

ndash Schedule search query bulletin board and so on

ndash There are many questions about short text detection in the Issues Board of langdetect Project

bull httpcodegooglecomplanguage-detectionissuesdetailid=10

bull Detection for multi-language mixed text

ndash Cut the target document in paragraphs or lines

ndash Detect for each short text

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 23

Our Goal

bull Over 99 accuracy

ndash However it is too difficult to detect one

word sentence

ndash Our Goal is 99+ accurate detection for

sentence with more than 3 words

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 24

We need

bull Rich feature extractable model from

short text

ndash Maximal substring model

(infin-gram Logistic Regression)

bull and twitter-specific Language model

or Corpus to construct it

ndash about 700K tweet corpus with language

label

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 25

Proposal Method

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 26

How to increase features from 3-grams

bull The more n the more features

bull Maximum at n=infin that is all substring

ndash But it has O(T2) order

gram of n-gram

freq≧1 freq≧2 freq≧10

1 79 72 57

2 1896 1533 902

3 15970 10369 4525

4 64966 33941 10534

5 167543 69719 15538

6 323749 107861 18970

7 524634 142954 21093

8 760719 171995 22159

9 921361 193995 22696

cumulative distributuion of feature length for 5090 normalized English tweets (300KB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 27

Text Categorization with All Substring Features [Okanohara+ 09]

bull Multiclass Logistic Regression using all

substrings as features

ndash Maximal Substring makes the equivalent

model that can be constructed in linear

time

ndash Store features into TRIE fast prediction

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 28

Maximal Substring (1)

bull Define a containment(semi-order)

among non empty substrings

abracadabra

ndash ldquorardquo sub ldquobraldquo hArr all rdquorardquo occur

as the substring of ldquobrardquo

ndash ldquoardquo nsub ldquoraldquo hArr ldquoardquo occur in not only ldquoraldquo

but also ldquocardquo It is strictly defined with also its position in the substring

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 29

Maximal Substring (2)

bull Each equivalent class formed by the containment relationship has a unique maximal element that is named Maximal Substring

bull Maximal substrings of abracadabra are a abra and abracadabra

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 30

via httpdhatenanejpnokuno201202031328237067

Maximal Substring and Infinity-Gram

bull Frequencies of substrings that have a containment relationship always equal

bull In the model with linear combination of features it is possible to enclose the common feature values

bull Logistic regression with maximal substrings is equivalent to the one with infinity-grams

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Although the equivalence collapses for test set

we assumes that it can be approximated by a sufficiently large training set

31

Extended Suffix Array

bull Extended Suffix Array consists of

ndash SA=Suffix Array

ndash L=Longest Common Prefixes

ndash B=Burrows-Wheelers Transformed text

bull A maximal substring that occurs more than once corresponds to a internal node of Suffix Tree which is equivalent to a suffix with Lgt0 and BWT has more than 1 character type

ndash They can be calculated on linear time

bull esaxx Okanoharas implement of ESA

ndash httpcodegooglecompesaxx

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 32

via [Okanohara+ 09]

Corpus and Normalization

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 33

Target Languages

bull Limit character type to detect

ndash In short text detection mixed text can be

divided to type of characters

bull Latin alphabet language

ndash The most difficult alphabet type to detect

ndash Languages which speakers are over 5

million are more than 25

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 34

Whats Latin Alphabet

bull Latin alphabet ne ascii alphabet

ndash aring ą aelig eth Ħ ŋ and so on

bull They are assigned to 9 code blocks in Unicode

Range Name Supplement

U+0000-007F Basic Latin ascii

U+0080-00FF Latin-1 Supplement Most languages are covered with these U+0100-017F Latin Extended-A

U+0180-024F Latin Extended-B Rumanian

U+0250-02AF IPA Extensions

U+0300-036F Combining Diacritical Marks for tone symbol composition

U+1E00-1EFF Latin Extended Additional Vietnamese

U+2C60-2C7F Latin Extended-C These arenrsquot used by almost all present languages U+A720-A7FF Latin Extended-D

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 35

Latin Alphabets in Unicode Codepoint Chart

for Vietnamese only use often use sometimes

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 36

How to Create Corpus

bull Collect tweets with sample method of

twitter Streaming API

ndash Sampling 1 of all tweets (about 2

million tweets)

ndash Tweets in Latin alphabet language

account for 60 of them

bull The rest is only to annotate language

labels to these tweets

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 37

Language Label Annotation

bull Group tweets by their timezone

ndash French tweets account for about 1 of all ones

ndash But they account for 50 of ones in Paris

timezone only

bull Annotate tentative labels to tweets using

langdetect

ndash Remove non-French tweets from ones labeled lsquofrrsquo

ndash Recover French tweets from ones not labeled lsquofrrsquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 38

( 20 of the whole tweets have no timezone)

How to annotate

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 39

Swedish Norwegian Danish Vietnamese Lithuanian

Czech Hungarian Catalan Rumanian and Polish guides in turn

Created Corpus

bull Noiseless tweets for training data

bull Noiseful tweets with more than 3 words as test data

bull Work with Rauacutel Velaz and Hiroshi Manabe for Catalan corpus creation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 40

language training testca Catalan 9089 5082cs Czech 9082 7682da Dannish 7388 5524de German 44448 10065en English 44520 10168es Spanish 44118 10265fi Finnish 8087 7050fr French 44339 10098hu Hungarian 10030 4904id Indonesian 44722 10181it Italian 43366 10152nl Dutch 44682 10007no Norwegian 10124 8496pl Polish 16771 10152pt Portuguese 44215 10208ro Romanian 10021 5911sv Swedish 44054 10032tr Turkish 44703 10308vi Vietnamese 15030 10488

total 538789 166773

Simple Language Detection

bull Language detector can be constructed

from maximal substring model and

twitter corpus

ndash It still gets at most 98 accuracy

bull We guess it is necessary to reduce bias

ndash data size bias

ndash language-specific bias

ndash twitter-specific bias

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 41

Bias by Data Size

bull Tweet size in each language has huge bias

bull Level them out by sampling with replacement from each language up to the largest data

ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement

English

Portuguese

Spanish

Indonesian

Dutch

French

German

Turkish

Italian

Swedish

othersShort Text Language Detection with Infinity-Gram

(NAIST Seminar) 42

Convert to Lowercase on Multiple Languages

bull Conversion into lower case saves corpus and compresses model

bull But the lower case of I (U+0049) in Turkish differs from others

bull Convert to lower case excluding lsquoIrsquo

Upper case Lower case

Turkish

Azerbaijani

I (U+0049) ı (U+0131)

İ (U+0130) i (U+0069)

Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 43

Normalization for Rumanian

bull Rumanian uses acirc ă icirc ș ț in addition to a-z

bull There are 2 character type as st with a ldquobeardrdquo

ndash U+015E-F U+0162-3 st with cedilla

ndash U+0218-B st with comma below

bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia

bull The 2 code has the same design in some fonts

ndash Indistinguishable

ș ş U+0219 U+015F

ț ţ U+021B U+0163

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

44

Rumanian Character Affairs on PC

bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently

ndash 1989 Democratization in Rumania

ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode

ndash 2007 Rumania seated in the EU

ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 45

lsquost with cedillarsquo is used

on an advertisement board

in Bucharest

Normalization for Substitute Characters

bull lsquost with cedillarsquo are substitute characters

ndash But they are more popular than the others

ndash with cedilla with comma = 2 1

ndash ldquoRumanian IMErdquo outputs the substitutes too D

bull Regard lsquost with commarsquo as lsquost with cedillarsquo

ț ţ U+021B U+0163

I reckon it is similar to the relationship of

Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 46

Arabic Character Normalization (on language-detection)

bull Arabic and Persian have the similar trouble too

bull Character lsquoyehrsquo in Farsi corresponds to 2 code points

ndash Wikipedia uses ی (U+06cc Farsi yeh) only

ndash News uses ي(U+064a Arabic yeh) only

bull U+064a is a substitute in Farsi

ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc

ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails

bull Regard U+06cc as U+064a

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 47

Normalization for Vietnamese (1)

bull Vietnamese has 12 vowels

ndash a ă acirc e ecirc i y o ocirc ơ u ư

bull Vietnamese has 6 tones

ndash a ả agrave atilde aacute ạ

ndash These tone symbols are used also in general documents like news

bull The tone symbols can be appended to all vowels

ndash 12 6 = 72

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 48

Normalization for Vietnamese (2)

bull Representation of vowels with

tones

1 Use U+1ea0 - U+1ef9

bull ẵ = U+1eb5

2 Combine with Diacritical Marks

bull ẵ = U+0103 U+0303

ndash Half and half on news and tweet

bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 49

CJK-Kanji Normalization (1) (on language-detection)

bull CJK-Kanji has too many characters(more than 20K)

ndash Other character types have only 30-50 characters

bull The character space is very sparse

ndash Characters that donrsquot occur in the training corpus have no probabilities

bull eg 谢谢 Kanji for person name

ndash Common frequent characters are too strong

bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese

bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 50

CJK-Kanji Normalization (2) (on language-detection)

bull Group Kanjis by frequency and normalize each group to the representative character

ndash (1) K-means clustering

bull Use tf-idf on Wikipedia and Google News

bull K=50 (size of ascii alphabet = 52)

ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese

bull Simplified Chinese 现代汉语常用字表(3500)

bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)

bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998

ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much

bull Generate 130 clusters from product of (1) and (2)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 51

Normalization for twitter

bull Remove simply

ndash URL

ndash mention

ndash hash tag

ndash RT

ndash face mark using alphabet like XD p

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 52

Normalization for twitter-Specific Representation

bull How to Like lsquocoooooooollllllrsquo

bull Case 1 Make a normalization dictionary using [Brody+ 2011]

ndash Unsupervised normalization like coooollll rarr cool

ndash It canrsquot handle words that are not in the dictionary

bull Case 2 If the same character continues in more than 3 Shrink it to 2

ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of

bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on

bull Acronym (like WWW СССР) is not useful for language detection

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 53

Laugh Normalization

bull There are various laughs on each language

ndash HOW MUCH DO YOU LOVE COACH BEISTE

HHAHAHAHAHAH

ndash Hihihihi ) Habe ich regulaumlr 2x die Woche

ndash Tafil con eso Jajajajajajaja

ndash Malo Jejejeje XP

ndash kekeke chỗ đoacute lagravem aacuteo được ko em

bull Shrink them to double

ndash hahahha rArr haha

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 54

Implementation and Estimation

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 55

Language Detection with Infinity-Gram (ldig)

bull tweet language detection for Latin

alphabet

ndash httpsgithubcomshuyoldig

bull MIT license

bull Distribute also the trained model here

ndash infin-gram LR(maximal substring) [Okanohara+ 09]

ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]

ndash Double Array

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 56

Usage (1) Model Initialization

bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]

ndash Extract features from corpus and initialize model

ndash -m model directory

ndash -x path of maximal substring extractor (execute as external process)

ndash --ff Ignore less than the specified value

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 57

Maximal String Extractor

bull maxsubst [input file] [output file]

ndash Input as multiple line text

bull Replace TABs to ldquo ldquo line feeds to U+0001 in it

ndash Output as rdquo[features]yent[frequency]rdquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 58

Usage (2) Learn

bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]

ndash Learn the model using the corpus on 1 cycle of SGD

ndash -e learning rate of SGD

ndash -r regularizer of L1 regularization

ndash --wr what times to regularize for whole parameters

bull Parameters are too many to regularize the whle ones every step

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 59

Usage (3) Shrink Model

bull ldigpy -m [model] --shrink

ndash Remove Unefficient features(all

parameters of which are 0) from the

model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 60

Usage (4) Detect Language

bull ldigpy -m [model] [test data]

ndash Detect languages of test data and output

its result and summary

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 61

Data Format

bull Training and test data

ndash [correct label]yent[meta data]yent[text]

en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD

62

Usage (5) Estimation Tool

bull serverpy -m [model] -p [port number]

ndash Open httplocalhost[port] after it is executed

ndash Output their language probabilities contained features and their parameters for a text inputed in the text area

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 63

Estimation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus

As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size

64

language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992

total 166711 165053 9901 922 974

Estimation for LIGA dataset

bull Estimate using LIGA[Tromp+ 11] dataset

with 9066 tweets for 6 languages

ndash httpwwwwintuenl~mpechenprojectssmm

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Use 19 language model

65

Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996

total 9066 8992 992

Estimation for Europarl Dataset

Only supported languages for ldig

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 66

ldig langdetect CLDlanguage size correct rate correct rate correct rate

bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993

total 21000 13957 997 20850 993 20814 991

Conclusions

bull Language detector using maximal substring model

ndash Detect over 99 accuracy for 19 languages

ndash langdetect with tweet corpus even has 97 accuracy

bull If the corpus is maintained the precision will be still up

ndash There are still many mistakes (in particular da and no)

bull If metadata is added to features the precision will be still up

ndash How to add and train metadata at low cost

bull Desire to shrink the model without loss of precision

ndash Too large for application (gt100MB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 67

References

bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定

bull [Okanohara+ 09] Text Categorization with All Substring Features

bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs

bull [Cavnar+ 94] N-Gram-Based Text Categorization

bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 68

Page 4: Short Text Language Detection with Infinity-Gram

In What Language

bull Ik kan er nooit tegen als mensen me negeren

bull Aha ich seh angeblich suumlszlig aus

bull Czy moacutegłbym zasnąć w przedmieściach Twoich myśli

bull Ah Tak Saring skal jeg bare finde ud af hvordan

bull Det er ikke saring digg nei aring vi som har finale til helgaSkrekk og gru Takk )

bull tack kompis Hade faktiskt taumlnkt maila dig paring fb och fraringga vart du tog vaumlgen

bull Ccedilok doğru En buumlyuumlk hatayı yaptım

bull Icircncacircntat de cunoștință

bull Một người dacircn bị thương vagrave bốn người mất tiacutech sau khi một ngọn nuacutei lửa ở miền trung

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 6

Hints

bull Dutch if there is ik

bull German if there is ich or a letter szlig

bull Polish if there is czy or letters Ł ń ś or ź

bull Scandinavian if there is a letter aring

ndash Danish if there is af Tak means thanks

ndash Norwegian if there is nei Takk means thanks

ndash Swedish if there is och Tack means thanks

bull Turkish if there is a letter ı ( i without point) or ğ

bull Romanian if there is a letter ă or ș or ț

ndash Although ă is also used in Vietnamese it is easy to distinguish them

ndash Although ş is also used in Turkish it is easy to distinguish them

bull Vietnamese if there are many unreadable letters on WinXP P

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 7

In What Language (Solution)

bull Ik kan er nooit tegen als mensen me negeren Dutch

bull Aha ich seh angeblich suumlszlig aus German

bull Czy moacutegłbym zasnąć w przedmieściach Twoich myśli Polish

bull Ah Tak Saring skal jeg bare finde ud af hvordan Danish

bull Det er ikke saring digg nei aring vi som har finale til helgaSkrekk og gru Takk ) Norwegian

bull tack kompis Hade faktiskt taumlnkt maila dig paring fb och fraringga vart du tog vaumlgen Swedish

bull Ccedilok doğru En buumlyuumlk hatayı yaptım Turkish

bull Icircncacircntat de cunoștință Rumanian

bull Một người dacircn bị thương vagrave bốn người mất tiacutech sau khi một ngọn nuacutei lửa ở miền trung Vietnamese

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 8

Whats Language Detection

bull To detect what language the input text written in

ndash Time fries like arrow rarr English

ndash Buona sera rarr Italian

bull It is prior for many language processing tasks

ndash Language model is built for each language

ndash Text search classification extraction translation

bull It is possible to detect for long enough and noiseless text with more than 99 accuracy [Cavnar+ 94]

ndash 3-gram model is used in many methods

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 9

SPAM or not

bull It is necessary to know that it is written in Polish

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 10

Document Categorization with Naive Bayes Classifier

bull Categorize a document 119883 = (119883119894) into category 119862119896

ndash A document 119883 is represented as collection of words 119883119894 (bag-of-words)

bull Word probability assumes conditionally independent on each category

ndash 119901 119883 119862119896 = 119901 119883119894 119862k119894 (from independent hypothesis)

ndash where 119901(119883119894|119862) rate of word frequency for category

bull Estimate the category 119862k to maximize posterior

ndash 119901 119862k 119883 =119901 119883 119862k 119901 119862k

119901 119883prop 119901(119862k) 119901(119883119894|119862k)119894

ndash where 119901(119862k) prior for category

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 11

Language Detection with Naive Bayes Classifier

bull Document categorization with language

labels

ndash Categorize documents into English Japanese

and so on

bull Use character n-gram as features

ndash Unicode code point n-gram strictly speaking

ndash Assume character encoding of the document is

already known

bull Most applications know encoding of inside text data

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 12

Why Use n-Gram to Detect Language

bull Each language has proper characters and spelling rules

ndash ldquoeacuterdquo is often used in Spanish Italian and so on but not in English in principle

ndash There are many words which start with ldquoZrdquo in German but not in English

ndash There are many words which start with ldquoCrdquo in English but not in German

ndash Spelling ldquoThrdquo is often used in English but not in the other languages

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 13

T h i s T h i s larr1-gram

T Th hi is s larr2-gram

Th Thi his is larr3-gram

C L Z Th

English 075 047 002 074

German 010 037 053 003

French 038 069 001 001

language-detection(langdetect) (Nakatani 2010)

bull Language detection library for Java

ndash httpcodegooglecomplanguage-detection

ndash Apache License 20

ndash Character 3-gram + Bayesian filter

ndash Various normalizations + Feature sampling

bull 99 over precision for 53 languages

ndash Training with Wikipedia abstract

ndash Widely support including Asian languages

ndash Adopted by Apache Solr

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 14

Estimation with News Text

bull Test for crawled news text from web in 49 languages Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 15

Language size accuracyaf Afrikaans 200 199 (9950)ar Arabic 200 200 (10000)bg Bulgarian 200 200 (10000)bn Bengali 200 200 (10000)cs Czech 200 200 (10000)da Dannish 200 179 (8950)de German 200 200 (10000)el Greek 200 200 (10000)en English 200 200 (10000)es Spanish 200 200 (10000)fa Persian 200 200 (10000)fi Finnish 200 200 (10000)fr French 200 200 (10000)gu Gujarati 200 200 (10000)he Hebrew 200 200 (10000)hi Hindi 200 200 (10000)hr Croatian 200 200 (10000)hu Hungarian 200 200 (10000)id Indonesian 200 200 (10000)it Italian 200 200 (10000)ja Japanese 200 200 (10000)kn Kannada 200 200 (10000)ko Korean 200 200 (10000)mk Macedonian 200 200 (10000)ml Malayalam 200 200 (10000)

Language size accuracymr Marathi 200 200 (10000)ne Nepali 200 200 (10000)nl Dutch 200 200 (10000)no Norwegian 200 199 (9950)pa Punjabi 200 200 (10000)pl Polish 200 200 (10000)pt Portuguese 200 200 (10000)ro Romanian 200 200 (10000)ru Russian 200 200 (10000)sk Slovak 200 200 (10000)so Somali 200 200 (10000)sq Albanian 200 200 (10000)sv Swedish 200 200 (10000)sw Swahili 200 200 (10000)ta Tamil 200 200 (10000)te Telugu 200 200 (10000)th Thai 200 200 (10000)tl Tagalog 200 200 (10000)tr Turkish 200 200 (10000)uk Ukrainian 200 200 (10000)ur Urdu 200 200 (10000)vi Vietnamese 200 200 (10000)

zh-cn Simplified Chinese 200 200 (10000)zh-tw Traditional Chinese 200 200 (10000)

total 9800 9777 (9977)

Estimation with Europarl datasets

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 16

bull Test for 1000 samples for each

language from Europarl Parallel Corpus

ndash from the proceedings of the European Parliament

ndash httpwwwstatmtorgeuroparl

bull httpcodegooglecomplanguage-

detectiondownloadsdetailname=eur

oparl-testzip

language size correct accuracybg Bulgarian 1000 988 988cs Czech 1000 994 994da Dannish 1000 968 968de German 1000 998 998el Greek 1000 1000 1000en English 1000 996 996es Spanish 1000 996 996et Estonian 1000 996 996fi Finnish 1000 998 998fr French 1000 999 999hu Hungarian 1000 999 999it Italian 1000 999 999lt Lithuanian 1000 997 997lv Latvian 1000 999 999nl Dutch 1000 974 974pl Polish 1000 999 999pt Portuguese 1000 996 996ro Romanian 1000 999 999sk Slovak 1000 988 988sl Slovene 1000 976 976sv Swedish 1000 991 991

total 21000 20850 993

Language Detection has been over isnt it

17

We still have ENEMY to beat

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 18

Twitter Language Detection with the Existing Methods

bull Only 90-95 accuracy

for tweet corpus

bull LD = language-detection

bull CLD = Chromium Compact Language

Detection

ndash httpcodegooglecompchromium-

compact-language-detector

ndash regard ms(Malay) as id(Indonesian)

bull Tika = Apache Tika

ndash httptikaapacheorg

ndash Estimate on 15 languages which Tika

supports in our tweet corpus

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 19

language LD CLD Tikaca Catalan 953 930 838cs Czech 963 966 ----da Dannish 945 907 587de German 866 968 731en English 883 974 547es Spanish 915 905 444fi Finnish 989 994 948fr French 950 945 674hu Hungarian 858 890 762id Indonesian 897 928 ----it Italian 962 938 871nl Dutch 695 932 650no Norwegian 960 749 686pl Polish 980 978 888pt Portuguese 880 886 474ro Romanian 928 961 826sv Swedish 960 964 756tr Turkish 976 974 ----vi Vietnamese 987 989 ----

total 922 938 700

Chromium Compact Language Detection (CLD)

bull Porting the language detector from

Google Chromium ndash httpcodegooglecompchromium-compact-language-detector

ndash Implementation in C++ Python binding

ndash of supported languages CLD = 76

langdetect = 53

ndash Accuracy CLD = 9882 langdetect =

9922

bull for 17 languages on Europarl datasets bull httpblogmikemccandlesscom201110accuracy-and-performance-of-googleshtml

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 20

Is twitter Language Detection difficult (1)

bull Tweet is too short to extract 3-gram features

ndash At most 140 characters on twitter

ndash URLs mentions and hashtags are not useful to

detect

bull LIGA [Tromp+ 11]

ndash Graph-features based on 3-gram

bull Add long distance features

bull 95~98 accuracy for twitter Language Detection

bull 6 languages (de en es fr it nl)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 21

Is twitter Language Detection difficult (2)

bull Tweet is too noisy

ndash Representations against the languages orthography often appear

ndash Acronym Abbreviation lengthened word (like Cooooolll)

bull Likelihood of tweet tends to get smaller on normal language model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

OMG Oh My God

LOL Laughing Out Loud

LMAO Laughing My Ass Out

F4F Follow for Follow

MDR Mort de Rire (French)

TKT Ne tlsquoInquiegravete Pas (Fr)

u you

ur your

4 for

i0u I love you

k che (Italian)

anke anche(Italian)

Letter k isnt used in Italian

22

Motivation to Detect Short Text Language

bull There are many small chunks of text in addition to twitter

ndash Schedule search query bulletin board and so on

ndash There are many questions about short text detection in the Issues Board of langdetect Project

bull httpcodegooglecomplanguage-detectionissuesdetailid=10

bull Detection for multi-language mixed text

ndash Cut the target document in paragraphs or lines

ndash Detect for each short text

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 23

Our Goal

bull Over 99 accuracy

ndash However it is too difficult to detect one

word sentence

ndash Our Goal is 99+ accurate detection for

sentence with more than 3 words

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 24

We need

bull Rich feature extractable model from

short text

ndash Maximal substring model

(infin-gram Logistic Regression)

bull and twitter-specific Language model

or Corpus to construct it

ndash about 700K tweet corpus with language

label

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 25

Proposal Method

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 26

How to increase features from 3-grams

bull The more n the more features

bull Maximum at n=infin that is all substring

ndash But it has O(T2) order

gram of n-gram

freq≧1 freq≧2 freq≧10

1 79 72 57

2 1896 1533 902

3 15970 10369 4525

4 64966 33941 10534

5 167543 69719 15538

6 323749 107861 18970

7 524634 142954 21093

8 760719 171995 22159

9 921361 193995 22696

cumulative distributuion of feature length for 5090 normalized English tweets (300KB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 27

Text Categorization with All Substring Features [Okanohara+ 09]

bull Multiclass Logistic Regression using all

substrings as features

ndash Maximal Substring makes the equivalent

model that can be constructed in linear

time

ndash Store features into TRIE fast prediction

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 28

Maximal Substring (1)

bull Define a containment(semi-order)

among non empty substrings

abracadabra

ndash ldquorardquo sub ldquobraldquo hArr all rdquorardquo occur

as the substring of ldquobrardquo

ndash ldquoardquo nsub ldquoraldquo hArr ldquoardquo occur in not only ldquoraldquo

but also ldquocardquo It is strictly defined with also its position in the substring

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 29

Maximal Substring (2)

bull Each equivalent class formed by the containment relationship has a unique maximal element that is named Maximal Substring

bull Maximal substrings of abracadabra are a abra and abracadabra

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 30

via httpdhatenanejpnokuno201202031328237067

Maximal Substring and Infinity-Gram

bull Frequencies of substrings that have a containment relationship always equal

bull In the model with linear combination of features it is possible to enclose the common feature values

bull Logistic regression with maximal substrings is equivalent to the one with infinity-grams

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Although the equivalence collapses for test set

we assumes that it can be approximated by a sufficiently large training set

31

Extended Suffix Array

bull Extended Suffix Array consists of

ndash SA=Suffix Array

ndash L=Longest Common Prefixes

ndash B=Burrows-Wheelers Transformed text

bull A maximal substring that occurs more than once corresponds to a internal node of Suffix Tree which is equivalent to a suffix with Lgt0 and BWT has more than 1 character type

ndash They can be calculated on linear time

bull esaxx Okanoharas implement of ESA

ndash httpcodegooglecompesaxx

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 32

via [Okanohara+ 09]

Corpus and Normalization

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 33

Target Languages

bull Limit character type to detect

ndash In short text detection mixed text can be

divided to type of characters

bull Latin alphabet language

ndash The most difficult alphabet type to detect

ndash Languages which speakers are over 5

million are more than 25

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 34

Whats Latin Alphabet

bull Latin alphabet ne ascii alphabet

ndash aring ą aelig eth Ħ ŋ and so on

bull They are assigned to 9 code blocks in Unicode

Range Name Supplement

U+0000-007F Basic Latin ascii

U+0080-00FF Latin-1 Supplement Most languages are covered with these U+0100-017F Latin Extended-A

U+0180-024F Latin Extended-B Rumanian

U+0250-02AF IPA Extensions

U+0300-036F Combining Diacritical Marks for tone symbol composition

U+1E00-1EFF Latin Extended Additional Vietnamese

U+2C60-2C7F Latin Extended-C These arenrsquot used by almost all present languages U+A720-A7FF Latin Extended-D

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 35

Latin Alphabets in Unicode Codepoint Chart

for Vietnamese only use often use sometimes

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 36

How to Create Corpus

bull Collect tweets with sample method of

twitter Streaming API

ndash Sampling 1 of all tweets (about 2

million tweets)

ndash Tweets in Latin alphabet language

account for 60 of them

bull The rest is only to annotate language

labels to these tweets

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 37

Language Label Annotation

bull Group tweets by their timezone

ndash French tweets account for about 1 of all ones

ndash But they account for 50 of ones in Paris

timezone only

bull Annotate tentative labels to tweets using

langdetect

ndash Remove non-French tweets from ones labeled lsquofrrsquo

ndash Recover French tweets from ones not labeled lsquofrrsquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 38

( 20 of the whole tweets have no timezone)

How to annotate

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 39

Swedish Norwegian Danish Vietnamese Lithuanian

Czech Hungarian Catalan Rumanian and Polish guides in turn

Created Corpus

bull Noiseless tweets for training data

bull Noiseful tweets with more than 3 words as test data

bull Work with Rauacutel Velaz and Hiroshi Manabe for Catalan corpus creation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 40

language training testca Catalan 9089 5082cs Czech 9082 7682da Dannish 7388 5524de German 44448 10065en English 44520 10168es Spanish 44118 10265fi Finnish 8087 7050fr French 44339 10098hu Hungarian 10030 4904id Indonesian 44722 10181it Italian 43366 10152nl Dutch 44682 10007no Norwegian 10124 8496pl Polish 16771 10152pt Portuguese 44215 10208ro Romanian 10021 5911sv Swedish 44054 10032tr Turkish 44703 10308vi Vietnamese 15030 10488

total 538789 166773

Simple Language Detection

bull Language detector can be constructed

from maximal substring model and

twitter corpus

ndash It still gets at most 98 accuracy

bull We guess it is necessary to reduce bias

ndash data size bias

ndash language-specific bias

ndash twitter-specific bias

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 41

Bias by Data Size

bull Tweet size in each language has huge bias

bull Level them out by sampling with replacement from each language up to the largest data

ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement

English

Portuguese

Spanish

Indonesian

Dutch

French

German

Turkish

Italian

Swedish

othersShort Text Language Detection with Infinity-Gram

(NAIST Seminar) 42

Convert to Lowercase on Multiple Languages

bull Conversion into lower case saves corpus and compresses model

bull But the lower case of I (U+0049) in Turkish differs from others

bull Convert to lower case excluding lsquoIrsquo

Upper case Lower case

Turkish

Azerbaijani

I (U+0049) ı (U+0131)

İ (U+0130) i (U+0069)

Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 43

Normalization for Rumanian

bull Rumanian uses acirc ă icirc ș ț in addition to a-z

bull There are 2 character type as st with a ldquobeardrdquo

ndash U+015E-F U+0162-3 st with cedilla

ndash U+0218-B st with comma below

bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia

bull The 2 code has the same design in some fonts

ndash Indistinguishable

ș ş U+0219 U+015F

ț ţ U+021B U+0163

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

44

Rumanian Character Affairs on PC

bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently

ndash 1989 Democratization in Rumania

ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode

ndash 2007 Rumania seated in the EU

ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 45

lsquost with cedillarsquo is used

on an advertisement board

in Bucharest

Normalization for Substitute Characters

bull lsquost with cedillarsquo are substitute characters

ndash But they are more popular than the others

ndash with cedilla with comma = 2 1

ndash ldquoRumanian IMErdquo outputs the substitutes too D

bull Regard lsquost with commarsquo as lsquost with cedillarsquo

ț ţ U+021B U+0163

I reckon it is similar to the relationship of

Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 46

Arabic Character Normalization (on language-detection)

bull Arabic and Persian have the similar trouble too

bull Character lsquoyehrsquo in Farsi corresponds to 2 code points

ndash Wikipedia uses ی (U+06cc Farsi yeh) only

ndash News uses ي(U+064a Arabic yeh) only

bull U+064a is a substitute in Farsi

ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc

ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails

bull Regard U+06cc as U+064a

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 47

Normalization for Vietnamese (1)

bull Vietnamese has 12 vowels

ndash a ă acirc e ecirc i y o ocirc ơ u ư

bull Vietnamese has 6 tones

ndash a ả agrave atilde aacute ạ

ndash These tone symbols are used also in general documents like news

bull The tone symbols can be appended to all vowels

ndash 12 6 = 72

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 48

Normalization for Vietnamese (2)

bull Representation of vowels with

tones

1 Use U+1ea0 - U+1ef9

bull ẵ = U+1eb5

2 Combine with Diacritical Marks

bull ẵ = U+0103 U+0303

ndash Half and half on news and tweet

bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 49

CJK-Kanji Normalization (1) (on language-detection)

bull CJK-Kanji has too many characters(more than 20K)

ndash Other character types have only 30-50 characters

bull The character space is very sparse

ndash Characters that donrsquot occur in the training corpus have no probabilities

bull eg 谢谢 Kanji for person name

ndash Common frequent characters are too strong

bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese

bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 50

CJK-Kanji Normalization (2) (on language-detection)

bull Group Kanjis by frequency and normalize each group to the representative character

ndash (1) K-means clustering

bull Use tf-idf on Wikipedia and Google News

bull K=50 (size of ascii alphabet = 52)

ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese

bull Simplified Chinese 现代汉语常用字表(3500)

bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)

bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998

ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much

bull Generate 130 clusters from product of (1) and (2)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 51

Normalization for twitter

bull Remove simply

ndash URL

ndash mention

ndash hash tag

ndash RT

ndash face mark using alphabet like XD p

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 52

Normalization for twitter-Specific Representation

bull How to Like lsquocoooooooollllllrsquo

bull Case 1 Make a normalization dictionary using [Brody+ 2011]

ndash Unsupervised normalization like coooollll rarr cool

ndash It canrsquot handle words that are not in the dictionary

bull Case 2 If the same character continues in more than 3 Shrink it to 2

ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of

bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on

bull Acronym (like WWW СССР) is not useful for language detection

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 53

Laugh Normalization

bull There are various laughs on each language

ndash HOW MUCH DO YOU LOVE COACH BEISTE

HHAHAHAHAHAH

ndash Hihihihi ) Habe ich regulaumlr 2x die Woche

ndash Tafil con eso Jajajajajajaja

ndash Malo Jejejeje XP

ndash kekeke chỗ đoacute lagravem aacuteo được ko em

bull Shrink them to double

ndash hahahha rArr haha

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 54

Implementation and Estimation

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 55

Language Detection with Infinity-Gram (ldig)

bull tweet language detection for Latin

alphabet

ndash httpsgithubcomshuyoldig

bull MIT license

bull Distribute also the trained model here

ndash infin-gram LR(maximal substring) [Okanohara+ 09]

ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]

ndash Double Array

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 56

Usage (1) Model Initialization

bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]

ndash Extract features from corpus and initialize model

ndash -m model directory

ndash -x path of maximal substring extractor (execute as external process)

ndash --ff Ignore less than the specified value

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 57

Maximal String Extractor

bull maxsubst [input file] [output file]

ndash Input as multiple line text

bull Replace TABs to ldquo ldquo line feeds to U+0001 in it

ndash Output as rdquo[features]yent[frequency]rdquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 58

Usage (2) Learn

bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]

ndash Learn the model using the corpus on 1 cycle of SGD

ndash -e learning rate of SGD

ndash -r regularizer of L1 regularization

ndash --wr what times to regularize for whole parameters

bull Parameters are too many to regularize the whle ones every step

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 59

Usage (3) Shrink Model

bull ldigpy -m [model] --shrink

ndash Remove Unefficient features(all

parameters of which are 0) from the

model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 60

Usage (4) Detect Language

bull ldigpy -m [model] [test data]

ndash Detect languages of test data and output

its result and summary

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 61

Data Format

bull Training and test data

ndash [correct label]yent[meta data]yent[text]

en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD

62

Usage (5) Estimation Tool

bull serverpy -m [model] -p [port number]

ndash Open httplocalhost[port] after it is executed

ndash Output their language probabilities contained features and their parameters for a text inputed in the text area

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 63

Estimation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus

As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size

64

language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992

total 166711 165053 9901 922 974

Estimation for LIGA dataset

bull Estimate using LIGA[Tromp+ 11] dataset

with 9066 tweets for 6 languages

ndash httpwwwwintuenl~mpechenprojectssmm

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Use 19 language model

65

Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996

total 9066 8992 992

Estimation for Europarl Dataset

Only supported languages for ldig

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 66

ldig langdetect CLDlanguage size correct rate correct rate correct rate

bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993

total 21000 13957 997 20850 993 20814 991

Conclusions

bull Language detector using maximal substring model

ndash Detect over 99 accuracy for 19 languages

ndash langdetect with tweet corpus even has 97 accuracy

bull If the corpus is maintained the precision will be still up

ndash There are still many mistakes (in particular da and no)

bull If metadata is added to features the precision will be still up

ndash How to add and train metadata at low cost

bull Desire to shrink the model without loss of precision

ndash Too large for application (gt100MB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 67

References

bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定

bull [Okanohara+ 09] Text Categorization with All Substring Features

bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs

bull [Cavnar+ 94] N-Gram-Based Text Categorization

bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 68

Page 5: Short Text Language Detection with Infinity-Gram

Hints

bull Dutch if there is ik

bull German if there is ich or a letter szlig

bull Polish if there is czy or letters Ł ń ś or ź

bull Scandinavian if there is a letter aring

ndash Danish if there is af Tak means thanks

ndash Norwegian if there is nei Takk means thanks

ndash Swedish if there is och Tack means thanks

bull Turkish if there is a letter ı ( i without point) or ğ

bull Romanian if there is a letter ă or ș or ț

ndash Although ă is also used in Vietnamese it is easy to distinguish them

ndash Although ş is also used in Turkish it is easy to distinguish them

bull Vietnamese if there are many unreadable letters on WinXP P

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 7

In What Language (Solution)

bull Ik kan er nooit tegen als mensen me negeren Dutch

bull Aha ich seh angeblich suumlszlig aus German

bull Czy moacutegłbym zasnąć w przedmieściach Twoich myśli Polish

bull Ah Tak Saring skal jeg bare finde ud af hvordan Danish

bull Det er ikke saring digg nei aring vi som har finale til helgaSkrekk og gru Takk ) Norwegian

bull tack kompis Hade faktiskt taumlnkt maila dig paring fb och fraringga vart du tog vaumlgen Swedish

bull Ccedilok doğru En buumlyuumlk hatayı yaptım Turkish

bull Icircncacircntat de cunoștință Rumanian

bull Một người dacircn bị thương vagrave bốn người mất tiacutech sau khi một ngọn nuacutei lửa ở miền trung Vietnamese

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 8

Whats Language Detection

bull To detect what language the input text written in

ndash Time fries like arrow rarr English

ndash Buona sera rarr Italian

bull It is prior for many language processing tasks

ndash Language model is built for each language

ndash Text search classification extraction translation

bull It is possible to detect for long enough and noiseless text with more than 99 accuracy [Cavnar+ 94]

ndash 3-gram model is used in many methods

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 9

SPAM or not

bull It is necessary to know that it is written in Polish

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 10

Document Categorization with Naive Bayes Classifier

bull Categorize a document 119883 = (119883119894) into category 119862119896

ndash A document 119883 is represented as collection of words 119883119894 (bag-of-words)

bull Word probability assumes conditionally independent on each category

ndash 119901 119883 119862119896 = 119901 119883119894 119862k119894 (from independent hypothesis)

ndash where 119901(119883119894|119862) rate of word frequency for category

bull Estimate the category 119862k to maximize posterior

ndash 119901 119862k 119883 =119901 119883 119862k 119901 119862k

119901 119883prop 119901(119862k) 119901(119883119894|119862k)119894

ndash where 119901(119862k) prior for category

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 11

Language Detection with Naive Bayes Classifier

bull Document categorization with language

labels

ndash Categorize documents into English Japanese

and so on

bull Use character n-gram as features

ndash Unicode code point n-gram strictly speaking

ndash Assume character encoding of the document is

already known

bull Most applications know encoding of inside text data

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 12

Why Use n-Gram to Detect Language

bull Each language has proper characters and spelling rules

ndash ldquoeacuterdquo is often used in Spanish Italian and so on but not in English in principle

ndash There are many words which start with ldquoZrdquo in German but not in English

ndash There are many words which start with ldquoCrdquo in English but not in German

ndash Spelling ldquoThrdquo is often used in English but not in the other languages

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 13

T h i s T h i s larr1-gram

T Th hi is s larr2-gram

Th Thi his is larr3-gram

C L Z Th

English 075 047 002 074

German 010 037 053 003

French 038 069 001 001

language-detection(langdetect) (Nakatani 2010)

bull Language detection library for Java

ndash httpcodegooglecomplanguage-detection

ndash Apache License 20

ndash Character 3-gram + Bayesian filter

ndash Various normalizations + Feature sampling

bull 99 over precision for 53 languages

ndash Training with Wikipedia abstract

ndash Widely support including Asian languages

ndash Adopted by Apache Solr

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 14

Estimation with News Text

bull Test for crawled news text from web in 49 languages Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 15

Language size accuracyaf Afrikaans 200 199 (9950)ar Arabic 200 200 (10000)bg Bulgarian 200 200 (10000)bn Bengali 200 200 (10000)cs Czech 200 200 (10000)da Dannish 200 179 (8950)de German 200 200 (10000)el Greek 200 200 (10000)en English 200 200 (10000)es Spanish 200 200 (10000)fa Persian 200 200 (10000)fi Finnish 200 200 (10000)fr French 200 200 (10000)gu Gujarati 200 200 (10000)he Hebrew 200 200 (10000)hi Hindi 200 200 (10000)hr Croatian 200 200 (10000)hu Hungarian 200 200 (10000)id Indonesian 200 200 (10000)it Italian 200 200 (10000)ja Japanese 200 200 (10000)kn Kannada 200 200 (10000)ko Korean 200 200 (10000)mk Macedonian 200 200 (10000)ml Malayalam 200 200 (10000)

Language size accuracymr Marathi 200 200 (10000)ne Nepali 200 200 (10000)nl Dutch 200 200 (10000)no Norwegian 200 199 (9950)pa Punjabi 200 200 (10000)pl Polish 200 200 (10000)pt Portuguese 200 200 (10000)ro Romanian 200 200 (10000)ru Russian 200 200 (10000)sk Slovak 200 200 (10000)so Somali 200 200 (10000)sq Albanian 200 200 (10000)sv Swedish 200 200 (10000)sw Swahili 200 200 (10000)ta Tamil 200 200 (10000)te Telugu 200 200 (10000)th Thai 200 200 (10000)tl Tagalog 200 200 (10000)tr Turkish 200 200 (10000)uk Ukrainian 200 200 (10000)ur Urdu 200 200 (10000)vi Vietnamese 200 200 (10000)

zh-cn Simplified Chinese 200 200 (10000)zh-tw Traditional Chinese 200 200 (10000)

total 9800 9777 (9977)

Estimation with Europarl datasets

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 16

bull Test for 1000 samples for each

language from Europarl Parallel Corpus

ndash from the proceedings of the European Parliament

ndash httpwwwstatmtorgeuroparl

bull httpcodegooglecomplanguage-

detectiondownloadsdetailname=eur

oparl-testzip

language size correct accuracybg Bulgarian 1000 988 988cs Czech 1000 994 994da Dannish 1000 968 968de German 1000 998 998el Greek 1000 1000 1000en English 1000 996 996es Spanish 1000 996 996et Estonian 1000 996 996fi Finnish 1000 998 998fr French 1000 999 999hu Hungarian 1000 999 999it Italian 1000 999 999lt Lithuanian 1000 997 997lv Latvian 1000 999 999nl Dutch 1000 974 974pl Polish 1000 999 999pt Portuguese 1000 996 996ro Romanian 1000 999 999sk Slovak 1000 988 988sl Slovene 1000 976 976sv Swedish 1000 991 991

total 21000 20850 993

Language Detection has been over isnt it

17

We still have ENEMY to beat

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 18

Twitter Language Detection with the Existing Methods

bull Only 90-95 accuracy

for tweet corpus

bull LD = language-detection

bull CLD = Chromium Compact Language

Detection

ndash httpcodegooglecompchromium-

compact-language-detector

ndash regard ms(Malay) as id(Indonesian)

bull Tika = Apache Tika

ndash httptikaapacheorg

ndash Estimate on 15 languages which Tika

supports in our tweet corpus

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 19

language LD CLD Tikaca Catalan 953 930 838cs Czech 963 966 ----da Dannish 945 907 587de German 866 968 731en English 883 974 547es Spanish 915 905 444fi Finnish 989 994 948fr French 950 945 674hu Hungarian 858 890 762id Indonesian 897 928 ----it Italian 962 938 871nl Dutch 695 932 650no Norwegian 960 749 686pl Polish 980 978 888pt Portuguese 880 886 474ro Romanian 928 961 826sv Swedish 960 964 756tr Turkish 976 974 ----vi Vietnamese 987 989 ----

total 922 938 700

Chromium Compact Language Detection (CLD)

bull Porting the language detector from

Google Chromium ndash httpcodegooglecompchromium-compact-language-detector

ndash Implementation in C++ Python binding

ndash of supported languages CLD = 76

langdetect = 53

ndash Accuracy CLD = 9882 langdetect =

9922

bull for 17 languages on Europarl datasets bull httpblogmikemccandlesscom201110accuracy-and-performance-of-googleshtml

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 20

Is twitter Language Detection difficult (1)

bull Tweet is too short to extract 3-gram features

ndash At most 140 characters on twitter

ndash URLs mentions and hashtags are not useful to

detect

bull LIGA [Tromp+ 11]

ndash Graph-features based on 3-gram

bull Add long distance features

bull 95~98 accuracy for twitter Language Detection

bull 6 languages (de en es fr it nl)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 21

Is twitter Language Detection difficult (2)

bull Tweet is too noisy

ndash Representations against the languages orthography often appear

ndash Acronym Abbreviation lengthened word (like Cooooolll)

bull Likelihood of tweet tends to get smaller on normal language model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

OMG Oh My God

LOL Laughing Out Loud

LMAO Laughing My Ass Out

F4F Follow for Follow

MDR Mort de Rire (French)

TKT Ne tlsquoInquiegravete Pas (Fr)

u you

ur your

4 for

i0u I love you

k che (Italian)

anke anche(Italian)

Letter k isnt used in Italian

22

Motivation to Detect Short Text Language

bull There are many small chunks of text in addition to twitter

ndash Schedule search query bulletin board and so on

ndash There are many questions about short text detection in the Issues Board of langdetect Project

bull httpcodegooglecomplanguage-detectionissuesdetailid=10

bull Detection for multi-language mixed text

ndash Cut the target document in paragraphs or lines

ndash Detect for each short text

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 23

Our Goal

bull Over 99 accuracy

ndash However it is too difficult to detect one

word sentence

ndash Our Goal is 99+ accurate detection for

sentence with more than 3 words

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 24

We need

bull Rich feature extractable model from

short text

ndash Maximal substring model

(infin-gram Logistic Regression)

bull and twitter-specific Language model

or Corpus to construct it

ndash about 700K tweet corpus with language

label

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 25

Proposal Method

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 26

How to increase features from 3-grams

bull The more n the more features

bull Maximum at n=infin that is all substring

ndash But it has O(T2) order

gram of n-gram

freq≧1 freq≧2 freq≧10

1 79 72 57

2 1896 1533 902

3 15970 10369 4525

4 64966 33941 10534

5 167543 69719 15538

6 323749 107861 18970

7 524634 142954 21093

8 760719 171995 22159

9 921361 193995 22696

cumulative distributuion of feature length for 5090 normalized English tweets (300KB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 27

Text Categorization with All Substring Features [Okanohara+ 09]

bull Multiclass Logistic Regression using all

substrings as features

ndash Maximal Substring makes the equivalent

model that can be constructed in linear

time

ndash Store features into TRIE fast prediction

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 28

Maximal Substring (1)

bull Define a containment(semi-order)

among non empty substrings

abracadabra

ndash ldquorardquo sub ldquobraldquo hArr all rdquorardquo occur

as the substring of ldquobrardquo

ndash ldquoardquo nsub ldquoraldquo hArr ldquoardquo occur in not only ldquoraldquo

but also ldquocardquo It is strictly defined with also its position in the substring

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 29

Maximal Substring (2)

bull Each equivalent class formed by the containment relationship has a unique maximal element that is named Maximal Substring

bull Maximal substrings of abracadabra are a abra and abracadabra

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 30

via httpdhatenanejpnokuno201202031328237067

Maximal Substring and Infinity-Gram

bull Frequencies of substrings that have a containment relationship always equal

bull In the model with linear combination of features it is possible to enclose the common feature values

bull Logistic regression with maximal substrings is equivalent to the one with infinity-grams

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Although the equivalence collapses for test set

we assumes that it can be approximated by a sufficiently large training set

31

Extended Suffix Array

bull Extended Suffix Array consists of

ndash SA=Suffix Array

ndash L=Longest Common Prefixes

ndash B=Burrows-Wheelers Transformed text

bull A maximal substring that occurs more than once corresponds to a internal node of Suffix Tree which is equivalent to a suffix with Lgt0 and BWT has more than 1 character type

ndash They can be calculated on linear time

bull esaxx Okanoharas implement of ESA

ndash httpcodegooglecompesaxx

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 32

via [Okanohara+ 09]

Corpus and Normalization

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 33

Target Languages

bull Limit character type to detect

ndash In short text detection mixed text can be

divided to type of characters

bull Latin alphabet language

ndash The most difficult alphabet type to detect

ndash Languages which speakers are over 5

million are more than 25

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 34

Whats Latin Alphabet

bull Latin alphabet ne ascii alphabet

ndash aring ą aelig eth Ħ ŋ and so on

bull They are assigned to 9 code blocks in Unicode

Range Name Supplement

U+0000-007F Basic Latin ascii

U+0080-00FF Latin-1 Supplement Most languages are covered with these U+0100-017F Latin Extended-A

U+0180-024F Latin Extended-B Rumanian

U+0250-02AF IPA Extensions

U+0300-036F Combining Diacritical Marks for tone symbol composition

U+1E00-1EFF Latin Extended Additional Vietnamese

U+2C60-2C7F Latin Extended-C These arenrsquot used by almost all present languages U+A720-A7FF Latin Extended-D

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 35

Latin Alphabets in Unicode Codepoint Chart

for Vietnamese only use often use sometimes

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 36

How to Create Corpus

bull Collect tweets with sample method of

twitter Streaming API

ndash Sampling 1 of all tweets (about 2

million tweets)

ndash Tweets in Latin alphabet language

account for 60 of them

bull The rest is only to annotate language

labels to these tweets

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 37

Language Label Annotation

bull Group tweets by their timezone

ndash French tweets account for about 1 of all ones

ndash But they account for 50 of ones in Paris

timezone only

bull Annotate tentative labels to tweets using

langdetect

ndash Remove non-French tweets from ones labeled lsquofrrsquo

ndash Recover French tweets from ones not labeled lsquofrrsquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 38

( 20 of the whole tweets have no timezone)

How to annotate

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 39

Swedish Norwegian Danish Vietnamese Lithuanian

Czech Hungarian Catalan Rumanian and Polish guides in turn

Created Corpus

bull Noiseless tweets for training data

bull Noiseful tweets with more than 3 words as test data

bull Work with Rauacutel Velaz and Hiroshi Manabe for Catalan corpus creation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 40

language training testca Catalan 9089 5082cs Czech 9082 7682da Dannish 7388 5524de German 44448 10065en English 44520 10168es Spanish 44118 10265fi Finnish 8087 7050fr French 44339 10098hu Hungarian 10030 4904id Indonesian 44722 10181it Italian 43366 10152nl Dutch 44682 10007no Norwegian 10124 8496pl Polish 16771 10152pt Portuguese 44215 10208ro Romanian 10021 5911sv Swedish 44054 10032tr Turkish 44703 10308vi Vietnamese 15030 10488

total 538789 166773

Simple Language Detection

bull Language detector can be constructed

from maximal substring model and

twitter corpus

ndash It still gets at most 98 accuracy

bull We guess it is necessary to reduce bias

ndash data size bias

ndash language-specific bias

ndash twitter-specific bias

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 41

Bias by Data Size

bull Tweet size in each language has huge bias

bull Level them out by sampling with replacement from each language up to the largest data

ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement

English

Portuguese

Spanish

Indonesian

Dutch

French

German

Turkish

Italian

Swedish

othersShort Text Language Detection with Infinity-Gram

(NAIST Seminar) 42

Convert to Lowercase on Multiple Languages

bull Conversion into lower case saves corpus and compresses model

bull But the lower case of I (U+0049) in Turkish differs from others

bull Convert to lower case excluding lsquoIrsquo

Upper case Lower case

Turkish

Azerbaijani

I (U+0049) ı (U+0131)

İ (U+0130) i (U+0069)

Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 43

Normalization for Rumanian

bull Rumanian uses acirc ă icirc ș ț in addition to a-z

bull There are 2 character type as st with a ldquobeardrdquo

ndash U+015E-F U+0162-3 st with cedilla

ndash U+0218-B st with comma below

bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia

bull The 2 code has the same design in some fonts

ndash Indistinguishable

ș ş U+0219 U+015F

ț ţ U+021B U+0163

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

44

Rumanian Character Affairs on PC

bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently

ndash 1989 Democratization in Rumania

ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode

ndash 2007 Rumania seated in the EU

ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 45

lsquost with cedillarsquo is used

on an advertisement board

in Bucharest

Normalization for Substitute Characters

bull lsquost with cedillarsquo are substitute characters

ndash But they are more popular than the others

ndash with cedilla with comma = 2 1

ndash ldquoRumanian IMErdquo outputs the substitutes too D

bull Regard lsquost with commarsquo as lsquost with cedillarsquo

ț ţ U+021B U+0163

I reckon it is similar to the relationship of

Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 46

Arabic Character Normalization (on language-detection)

bull Arabic and Persian have the similar trouble too

bull Character lsquoyehrsquo in Farsi corresponds to 2 code points

ndash Wikipedia uses ی (U+06cc Farsi yeh) only

ndash News uses ي(U+064a Arabic yeh) only

bull U+064a is a substitute in Farsi

ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc

ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails

bull Regard U+06cc as U+064a

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 47

Normalization for Vietnamese (1)

bull Vietnamese has 12 vowels

ndash a ă acirc e ecirc i y o ocirc ơ u ư

bull Vietnamese has 6 tones

ndash a ả agrave atilde aacute ạ

ndash These tone symbols are used also in general documents like news

bull The tone symbols can be appended to all vowels

ndash 12 6 = 72

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 48

Normalization for Vietnamese (2)

bull Representation of vowels with

tones

1 Use U+1ea0 - U+1ef9

bull ẵ = U+1eb5

2 Combine with Diacritical Marks

bull ẵ = U+0103 U+0303

ndash Half and half on news and tweet

bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 49

CJK-Kanji Normalization (1) (on language-detection)

bull CJK-Kanji has too many characters(more than 20K)

ndash Other character types have only 30-50 characters

bull The character space is very sparse

ndash Characters that donrsquot occur in the training corpus have no probabilities

bull eg 谢谢 Kanji for person name

ndash Common frequent characters are too strong

bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese

bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 50

CJK-Kanji Normalization (2) (on language-detection)

bull Group Kanjis by frequency and normalize each group to the representative character

ndash (1) K-means clustering

bull Use tf-idf on Wikipedia and Google News

bull K=50 (size of ascii alphabet = 52)

ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese

bull Simplified Chinese 现代汉语常用字表(3500)

bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)

bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998

ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much

bull Generate 130 clusters from product of (1) and (2)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 51

Normalization for twitter

bull Remove simply

ndash URL

ndash mention

ndash hash tag

ndash RT

ndash face mark using alphabet like XD p

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 52

Normalization for twitter-Specific Representation

bull How to Like lsquocoooooooollllllrsquo

bull Case 1 Make a normalization dictionary using [Brody+ 2011]

ndash Unsupervised normalization like coooollll rarr cool

ndash It canrsquot handle words that are not in the dictionary

bull Case 2 If the same character continues in more than 3 Shrink it to 2

ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of

bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on

bull Acronym (like WWW СССР) is not useful for language detection

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 53

Laugh Normalization

bull There are various laughs on each language

ndash HOW MUCH DO YOU LOVE COACH BEISTE

HHAHAHAHAHAH

ndash Hihihihi ) Habe ich regulaumlr 2x die Woche

ndash Tafil con eso Jajajajajajaja

ndash Malo Jejejeje XP

ndash kekeke chỗ đoacute lagravem aacuteo được ko em

bull Shrink them to double

ndash hahahha rArr haha

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 54

Implementation and Estimation

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 55

Language Detection with Infinity-Gram (ldig)

bull tweet language detection for Latin

alphabet

ndash httpsgithubcomshuyoldig

bull MIT license

bull Distribute also the trained model here

ndash infin-gram LR(maximal substring) [Okanohara+ 09]

ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]

ndash Double Array

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 56

Usage (1) Model Initialization

bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]

ndash Extract features from corpus and initialize model

ndash -m model directory

ndash -x path of maximal substring extractor (execute as external process)

ndash --ff Ignore less than the specified value

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 57

Maximal String Extractor

bull maxsubst [input file] [output file]

ndash Input as multiple line text

bull Replace TABs to ldquo ldquo line feeds to U+0001 in it

ndash Output as rdquo[features]yent[frequency]rdquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 58

Usage (2) Learn

bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]

ndash Learn the model using the corpus on 1 cycle of SGD

ndash -e learning rate of SGD

ndash -r regularizer of L1 regularization

ndash --wr what times to regularize for whole parameters

bull Parameters are too many to regularize the whle ones every step

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 59

Usage (3) Shrink Model

bull ldigpy -m [model] --shrink

ndash Remove Unefficient features(all

parameters of which are 0) from the

model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 60

Usage (4) Detect Language

bull ldigpy -m [model] [test data]

ndash Detect languages of test data and output

its result and summary

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 61

Data Format

bull Training and test data

ndash [correct label]yent[meta data]yent[text]

en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD

62

Usage (5) Estimation Tool

bull serverpy -m [model] -p [port number]

ndash Open httplocalhost[port] after it is executed

ndash Output their language probabilities contained features and their parameters for a text inputed in the text area

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 63

Estimation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus

As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size

64

language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992

total 166711 165053 9901 922 974

Estimation for LIGA dataset

bull Estimate using LIGA[Tromp+ 11] dataset

with 9066 tweets for 6 languages

ndash httpwwwwintuenl~mpechenprojectssmm

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Use 19 language model

65

Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996

total 9066 8992 992

Estimation for Europarl Dataset

Only supported languages for ldig

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 66

ldig langdetect CLDlanguage size correct rate correct rate correct rate

bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993

total 21000 13957 997 20850 993 20814 991

Conclusions

bull Language detector using maximal substring model

ndash Detect over 99 accuracy for 19 languages

ndash langdetect with tweet corpus even has 97 accuracy

bull If the corpus is maintained the precision will be still up

ndash There are still many mistakes (in particular da and no)

bull If metadata is added to features the precision will be still up

ndash How to add and train metadata at low cost

bull Desire to shrink the model without loss of precision

ndash Too large for application (gt100MB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 67

References

bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定

bull [Okanohara+ 09] Text Categorization with All Substring Features

bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs

bull [Cavnar+ 94] N-Gram-Based Text Categorization

bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 68

Page 6: Short Text Language Detection with Infinity-Gram

In What Language (Solution)

bull Ik kan er nooit tegen als mensen me negeren Dutch

bull Aha ich seh angeblich suumlszlig aus German

bull Czy moacutegłbym zasnąć w przedmieściach Twoich myśli Polish

bull Ah Tak Saring skal jeg bare finde ud af hvordan Danish

bull Det er ikke saring digg nei aring vi som har finale til helgaSkrekk og gru Takk ) Norwegian

bull tack kompis Hade faktiskt taumlnkt maila dig paring fb och fraringga vart du tog vaumlgen Swedish

bull Ccedilok doğru En buumlyuumlk hatayı yaptım Turkish

bull Icircncacircntat de cunoștință Rumanian

bull Một người dacircn bị thương vagrave bốn người mất tiacutech sau khi một ngọn nuacutei lửa ở miền trung Vietnamese

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 8

Whats Language Detection

bull To detect what language the input text written in

ndash Time fries like arrow rarr English

ndash Buona sera rarr Italian

bull It is prior for many language processing tasks

ndash Language model is built for each language

ndash Text search classification extraction translation

bull It is possible to detect for long enough and noiseless text with more than 99 accuracy [Cavnar+ 94]

ndash 3-gram model is used in many methods

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 9

SPAM or not

bull It is necessary to know that it is written in Polish

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 10

Document Categorization with Naive Bayes Classifier

bull Categorize a document 119883 = (119883119894) into category 119862119896

ndash A document 119883 is represented as collection of words 119883119894 (bag-of-words)

bull Word probability assumes conditionally independent on each category

ndash 119901 119883 119862119896 = 119901 119883119894 119862k119894 (from independent hypothesis)

ndash where 119901(119883119894|119862) rate of word frequency for category

bull Estimate the category 119862k to maximize posterior

ndash 119901 119862k 119883 =119901 119883 119862k 119901 119862k

119901 119883prop 119901(119862k) 119901(119883119894|119862k)119894

ndash where 119901(119862k) prior for category

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 11

Language Detection with Naive Bayes Classifier

bull Document categorization with language

labels

ndash Categorize documents into English Japanese

and so on

bull Use character n-gram as features

ndash Unicode code point n-gram strictly speaking

ndash Assume character encoding of the document is

already known

bull Most applications know encoding of inside text data

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 12

Why Use n-Gram to Detect Language

bull Each language has proper characters and spelling rules

ndash ldquoeacuterdquo is often used in Spanish Italian and so on but not in English in principle

ndash There are many words which start with ldquoZrdquo in German but not in English

ndash There are many words which start with ldquoCrdquo in English but not in German

ndash Spelling ldquoThrdquo is often used in English but not in the other languages

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 13

T h i s T h i s larr1-gram

T Th hi is s larr2-gram

Th Thi his is larr3-gram

C L Z Th

English 075 047 002 074

German 010 037 053 003

French 038 069 001 001

language-detection(langdetect) (Nakatani 2010)

bull Language detection library for Java

ndash httpcodegooglecomplanguage-detection

ndash Apache License 20

ndash Character 3-gram + Bayesian filter

ndash Various normalizations + Feature sampling

bull 99 over precision for 53 languages

ndash Training with Wikipedia abstract

ndash Widely support including Asian languages

ndash Adopted by Apache Solr

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 14

Estimation with News Text

bull Test for crawled news text from web in 49 languages Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 15

Language size accuracyaf Afrikaans 200 199 (9950)ar Arabic 200 200 (10000)bg Bulgarian 200 200 (10000)bn Bengali 200 200 (10000)cs Czech 200 200 (10000)da Dannish 200 179 (8950)de German 200 200 (10000)el Greek 200 200 (10000)en English 200 200 (10000)es Spanish 200 200 (10000)fa Persian 200 200 (10000)fi Finnish 200 200 (10000)fr French 200 200 (10000)gu Gujarati 200 200 (10000)he Hebrew 200 200 (10000)hi Hindi 200 200 (10000)hr Croatian 200 200 (10000)hu Hungarian 200 200 (10000)id Indonesian 200 200 (10000)it Italian 200 200 (10000)ja Japanese 200 200 (10000)kn Kannada 200 200 (10000)ko Korean 200 200 (10000)mk Macedonian 200 200 (10000)ml Malayalam 200 200 (10000)

Language size accuracymr Marathi 200 200 (10000)ne Nepali 200 200 (10000)nl Dutch 200 200 (10000)no Norwegian 200 199 (9950)pa Punjabi 200 200 (10000)pl Polish 200 200 (10000)pt Portuguese 200 200 (10000)ro Romanian 200 200 (10000)ru Russian 200 200 (10000)sk Slovak 200 200 (10000)so Somali 200 200 (10000)sq Albanian 200 200 (10000)sv Swedish 200 200 (10000)sw Swahili 200 200 (10000)ta Tamil 200 200 (10000)te Telugu 200 200 (10000)th Thai 200 200 (10000)tl Tagalog 200 200 (10000)tr Turkish 200 200 (10000)uk Ukrainian 200 200 (10000)ur Urdu 200 200 (10000)vi Vietnamese 200 200 (10000)

zh-cn Simplified Chinese 200 200 (10000)zh-tw Traditional Chinese 200 200 (10000)

total 9800 9777 (9977)

Estimation with Europarl datasets

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 16

bull Test for 1000 samples for each

language from Europarl Parallel Corpus

ndash from the proceedings of the European Parliament

ndash httpwwwstatmtorgeuroparl

bull httpcodegooglecomplanguage-

detectiondownloadsdetailname=eur

oparl-testzip

language size correct accuracybg Bulgarian 1000 988 988cs Czech 1000 994 994da Dannish 1000 968 968de German 1000 998 998el Greek 1000 1000 1000en English 1000 996 996es Spanish 1000 996 996et Estonian 1000 996 996fi Finnish 1000 998 998fr French 1000 999 999hu Hungarian 1000 999 999it Italian 1000 999 999lt Lithuanian 1000 997 997lv Latvian 1000 999 999nl Dutch 1000 974 974pl Polish 1000 999 999pt Portuguese 1000 996 996ro Romanian 1000 999 999sk Slovak 1000 988 988sl Slovene 1000 976 976sv Swedish 1000 991 991

total 21000 20850 993

Language Detection has been over isnt it

17

We still have ENEMY to beat

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 18

Twitter Language Detection with the Existing Methods

bull Only 90-95 accuracy

for tweet corpus

bull LD = language-detection

bull CLD = Chromium Compact Language

Detection

ndash httpcodegooglecompchromium-

compact-language-detector

ndash regard ms(Malay) as id(Indonesian)

bull Tika = Apache Tika

ndash httptikaapacheorg

ndash Estimate on 15 languages which Tika

supports in our tweet corpus

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 19

language LD CLD Tikaca Catalan 953 930 838cs Czech 963 966 ----da Dannish 945 907 587de German 866 968 731en English 883 974 547es Spanish 915 905 444fi Finnish 989 994 948fr French 950 945 674hu Hungarian 858 890 762id Indonesian 897 928 ----it Italian 962 938 871nl Dutch 695 932 650no Norwegian 960 749 686pl Polish 980 978 888pt Portuguese 880 886 474ro Romanian 928 961 826sv Swedish 960 964 756tr Turkish 976 974 ----vi Vietnamese 987 989 ----

total 922 938 700

Chromium Compact Language Detection (CLD)

bull Porting the language detector from

Google Chromium ndash httpcodegooglecompchromium-compact-language-detector

ndash Implementation in C++ Python binding

ndash of supported languages CLD = 76

langdetect = 53

ndash Accuracy CLD = 9882 langdetect =

9922

bull for 17 languages on Europarl datasets bull httpblogmikemccandlesscom201110accuracy-and-performance-of-googleshtml

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 20

Is twitter Language Detection difficult (1)

bull Tweet is too short to extract 3-gram features

ndash At most 140 characters on twitter

ndash URLs mentions and hashtags are not useful to

detect

bull LIGA [Tromp+ 11]

ndash Graph-features based on 3-gram

bull Add long distance features

bull 95~98 accuracy for twitter Language Detection

bull 6 languages (de en es fr it nl)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 21

Is twitter Language Detection difficult (2)

bull Tweet is too noisy

ndash Representations against the languages orthography often appear

ndash Acronym Abbreviation lengthened word (like Cooooolll)

bull Likelihood of tweet tends to get smaller on normal language model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

OMG Oh My God

LOL Laughing Out Loud

LMAO Laughing My Ass Out

F4F Follow for Follow

MDR Mort de Rire (French)

TKT Ne tlsquoInquiegravete Pas (Fr)

u you

ur your

4 for

i0u I love you

k che (Italian)

anke anche(Italian)

Letter k isnt used in Italian

22

Motivation to Detect Short Text Language

bull There are many small chunks of text in addition to twitter

ndash Schedule search query bulletin board and so on

ndash There are many questions about short text detection in the Issues Board of langdetect Project

bull httpcodegooglecomplanguage-detectionissuesdetailid=10

bull Detection for multi-language mixed text

ndash Cut the target document in paragraphs or lines

ndash Detect for each short text

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 23

Our Goal

bull Over 99 accuracy

ndash However it is too difficult to detect one

word sentence

ndash Our Goal is 99+ accurate detection for

sentence with more than 3 words

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 24

We need

bull Rich feature extractable model from

short text

ndash Maximal substring model

(infin-gram Logistic Regression)

bull and twitter-specific Language model

or Corpus to construct it

ndash about 700K tweet corpus with language

label

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 25

Proposal Method

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 26

How to increase features from 3-grams

bull The more n the more features

bull Maximum at n=infin that is all substring

ndash But it has O(T2) order

gram of n-gram

freq≧1 freq≧2 freq≧10

1 79 72 57

2 1896 1533 902

3 15970 10369 4525

4 64966 33941 10534

5 167543 69719 15538

6 323749 107861 18970

7 524634 142954 21093

8 760719 171995 22159

9 921361 193995 22696

cumulative distributuion of feature length for 5090 normalized English tweets (300KB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 27

Text Categorization with All Substring Features [Okanohara+ 09]

bull Multiclass Logistic Regression using all

substrings as features

ndash Maximal Substring makes the equivalent

model that can be constructed in linear

time

ndash Store features into TRIE fast prediction

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 28

Maximal Substring (1)

bull Define a containment(semi-order)

among non empty substrings

abracadabra

ndash ldquorardquo sub ldquobraldquo hArr all rdquorardquo occur

as the substring of ldquobrardquo

ndash ldquoardquo nsub ldquoraldquo hArr ldquoardquo occur in not only ldquoraldquo

but also ldquocardquo It is strictly defined with also its position in the substring

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 29

Maximal Substring (2)

bull Each equivalent class formed by the containment relationship has a unique maximal element that is named Maximal Substring

bull Maximal substrings of abracadabra are a abra and abracadabra

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 30

via httpdhatenanejpnokuno201202031328237067

Maximal Substring and Infinity-Gram

bull Frequencies of substrings that have a containment relationship always equal

bull In the model with linear combination of features it is possible to enclose the common feature values

bull Logistic regression with maximal substrings is equivalent to the one with infinity-grams

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Although the equivalence collapses for test set

we assumes that it can be approximated by a sufficiently large training set

31

Extended Suffix Array

bull Extended Suffix Array consists of

ndash SA=Suffix Array

ndash L=Longest Common Prefixes

ndash B=Burrows-Wheelers Transformed text

bull A maximal substring that occurs more than once corresponds to a internal node of Suffix Tree which is equivalent to a suffix with Lgt0 and BWT has more than 1 character type

ndash They can be calculated on linear time

bull esaxx Okanoharas implement of ESA

ndash httpcodegooglecompesaxx

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 32

via [Okanohara+ 09]

Corpus and Normalization

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 33

Target Languages

bull Limit character type to detect

ndash In short text detection mixed text can be

divided to type of characters

bull Latin alphabet language

ndash The most difficult alphabet type to detect

ndash Languages which speakers are over 5

million are more than 25

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 34

Whats Latin Alphabet

bull Latin alphabet ne ascii alphabet

ndash aring ą aelig eth Ħ ŋ and so on

bull They are assigned to 9 code blocks in Unicode

Range Name Supplement

U+0000-007F Basic Latin ascii

U+0080-00FF Latin-1 Supplement Most languages are covered with these U+0100-017F Latin Extended-A

U+0180-024F Latin Extended-B Rumanian

U+0250-02AF IPA Extensions

U+0300-036F Combining Diacritical Marks for tone symbol composition

U+1E00-1EFF Latin Extended Additional Vietnamese

U+2C60-2C7F Latin Extended-C These arenrsquot used by almost all present languages U+A720-A7FF Latin Extended-D

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 35

Latin Alphabets in Unicode Codepoint Chart

for Vietnamese only use often use sometimes

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 36

How to Create Corpus

bull Collect tweets with sample method of

twitter Streaming API

ndash Sampling 1 of all tweets (about 2

million tweets)

ndash Tweets in Latin alphabet language

account for 60 of them

bull The rest is only to annotate language

labels to these tweets

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 37

Language Label Annotation

bull Group tweets by their timezone

ndash French tweets account for about 1 of all ones

ndash But they account for 50 of ones in Paris

timezone only

bull Annotate tentative labels to tweets using

langdetect

ndash Remove non-French tweets from ones labeled lsquofrrsquo

ndash Recover French tweets from ones not labeled lsquofrrsquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 38

( 20 of the whole tweets have no timezone)

How to annotate

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 39

Swedish Norwegian Danish Vietnamese Lithuanian

Czech Hungarian Catalan Rumanian and Polish guides in turn

Created Corpus

bull Noiseless tweets for training data

bull Noiseful tweets with more than 3 words as test data

bull Work with Rauacutel Velaz and Hiroshi Manabe for Catalan corpus creation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 40

language training testca Catalan 9089 5082cs Czech 9082 7682da Dannish 7388 5524de German 44448 10065en English 44520 10168es Spanish 44118 10265fi Finnish 8087 7050fr French 44339 10098hu Hungarian 10030 4904id Indonesian 44722 10181it Italian 43366 10152nl Dutch 44682 10007no Norwegian 10124 8496pl Polish 16771 10152pt Portuguese 44215 10208ro Romanian 10021 5911sv Swedish 44054 10032tr Turkish 44703 10308vi Vietnamese 15030 10488

total 538789 166773

Simple Language Detection

bull Language detector can be constructed

from maximal substring model and

twitter corpus

ndash It still gets at most 98 accuracy

bull We guess it is necessary to reduce bias

ndash data size bias

ndash language-specific bias

ndash twitter-specific bias

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 41

Bias by Data Size

bull Tweet size in each language has huge bias

bull Level them out by sampling with replacement from each language up to the largest data

ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement

English

Portuguese

Spanish

Indonesian

Dutch

French

German

Turkish

Italian

Swedish

othersShort Text Language Detection with Infinity-Gram

(NAIST Seminar) 42

Convert to Lowercase on Multiple Languages

bull Conversion into lower case saves corpus and compresses model

bull But the lower case of I (U+0049) in Turkish differs from others

bull Convert to lower case excluding lsquoIrsquo

Upper case Lower case

Turkish

Azerbaijani

I (U+0049) ı (U+0131)

İ (U+0130) i (U+0069)

Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 43

Normalization for Rumanian

bull Rumanian uses acirc ă icirc ș ț in addition to a-z

bull There are 2 character type as st with a ldquobeardrdquo

ndash U+015E-F U+0162-3 st with cedilla

ndash U+0218-B st with comma below

bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia

bull The 2 code has the same design in some fonts

ndash Indistinguishable

ș ş U+0219 U+015F

ț ţ U+021B U+0163

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

44

Rumanian Character Affairs on PC

bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently

ndash 1989 Democratization in Rumania

ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode

ndash 2007 Rumania seated in the EU

ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 45

lsquost with cedillarsquo is used

on an advertisement board

in Bucharest

Normalization for Substitute Characters

bull lsquost with cedillarsquo are substitute characters

ndash But they are more popular than the others

ndash with cedilla with comma = 2 1

ndash ldquoRumanian IMErdquo outputs the substitutes too D

bull Regard lsquost with commarsquo as lsquost with cedillarsquo

ț ţ U+021B U+0163

I reckon it is similar to the relationship of

Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 46

Arabic Character Normalization (on language-detection)

bull Arabic and Persian have the similar trouble too

bull Character lsquoyehrsquo in Farsi corresponds to 2 code points

ndash Wikipedia uses ی (U+06cc Farsi yeh) only

ndash News uses ي(U+064a Arabic yeh) only

bull U+064a is a substitute in Farsi

ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc

ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails

bull Regard U+06cc as U+064a

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 47

Normalization for Vietnamese (1)

bull Vietnamese has 12 vowels

ndash a ă acirc e ecirc i y o ocirc ơ u ư

bull Vietnamese has 6 tones

ndash a ả agrave atilde aacute ạ

ndash These tone symbols are used also in general documents like news

bull The tone symbols can be appended to all vowels

ndash 12 6 = 72

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 48

Normalization for Vietnamese (2)

bull Representation of vowels with

tones

1 Use U+1ea0 - U+1ef9

bull ẵ = U+1eb5

2 Combine with Diacritical Marks

bull ẵ = U+0103 U+0303

ndash Half and half on news and tweet

bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 49

CJK-Kanji Normalization (1) (on language-detection)

bull CJK-Kanji has too many characters(more than 20K)

ndash Other character types have only 30-50 characters

bull The character space is very sparse

ndash Characters that donrsquot occur in the training corpus have no probabilities

bull eg 谢谢 Kanji for person name

ndash Common frequent characters are too strong

bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese

bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 50

CJK-Kanji Normalization (2) (on language-detection)

bull Group Kanjis by frequency and normalize each group to the representative character

ndash (1) K-means clustering

bull Use tf-idf on Wikipedia and Google News

bull K=50 (size of ascii alphabet = 52)

ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese

bull Simplified Chinese 现代汉语常用字表(3500)

bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)

bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998

ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much

bull Generate 130 clusters from product of (1) and (2)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 51

Normalization for twitter

bull Remove simply

ndash URL

ndash mention

ndash hash tag

ndash RT

ndash face mark using alphabet like XD p

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 52

Normalization for twitter-Specific Representation

bull How to Like lsquocoooooooollllllrsquo

bull Case 1 Make a normalization dictionary using [Brody+ 2011]

ndash Unsupervised normalization like coooollll rarr cool

ndash It canrsquot handle words that are not in the dictionary

bull Case 2 If the same character continues in more than 3 Shrink it to 2

ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of

bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on

bull Acronym (like WWW СССР) is not useful for language detection

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 53

Laugh Normalization

bull There are various laughs on each language

ndash HOW MUCH DO YOU LOVE COACH BEISTE

HHAHAHAHAHAH

ndash Hihihihi ) Habe ich regulaumlr 2x die Woche

ndash Tafil con eso Jajajajajajaja

ndash Malo Jejejeje XP

ndash kekeke chỗ đoacute lagravem aacuteo được ko em

bull Shrink them to double

ndash hahahha rArr haha

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 54

Implementation and Estimation

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 55

Language Detection with Infinity-Gram (ldig)

bull tweet language detection for Latin

alphabet

ndash httpsgithubcomshuyoldig

bull MIT license

bull Distribute also the trained model here

ndash infin-gram LR(maximal substring) [Okanohara+ 09]

ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]

ndash Double Array

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 56

Usage (1) Model Initialization

bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]

ndash Extract features from corpus and initialize model

ndash -m model directory

ndash -x path of maximal substring extractor (execute as external process)

ndash --ff Ignore less than the specified value

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 57

Maximal String Extractor

bull maxsubst [input file] [output file]

ndash Input as multiple line text

bull Replace TABs to ldquo ldquo line feeds to U+0001 in it

ndash Output as rdquo[features]yent[frequency]rdquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 58

Usage (2) Learn

bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]

ndash Learn the model using the corpus on 1 cycle of SGD

ndash -e learning rate of SGD

ndash -r regularizer of L1 regularization

ndash --wr what times to regularize for whole parameters

bull Parameters are too many to regularize the whle ones every step

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 59

Usage (3) Shrink Model

bull ldigpy -m [model] --shrink

ndash Remove Unefficient features(all

parameters of which are 0) from the

model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 60

Usage (4) Detect Language

bull ldigpy -m [model] [test data]

ndash Detect languages of test data and output

its result and summary

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 61

Data Format

bull Training and test data

ndash [correct label]yent[meta data]yent[text]

en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD

62

Usage (5) Estimation Tool

bull serverpy -m [model] -p [port number]

ndash Open httplocalhost[port] after it is executed

ndash Output their language probabilities contained features and their parameters for a text inputed in the text area

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 63

Estimation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus

As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size

64

language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992

total 166711 165053 9901 922 974

Estimation for LIGA dataset

bull Estimate using LIGA[Tromp+ 11] dataset

with 9066 tweets for 6 languages

ndash httpwwwwintuenl~mpechenprojectssmm

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Use 19 language model

65

Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996

total 9066 8992 992

Estimation for Europarl Dataset

Only supported languages for ldig

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 66

ldig langdetect CLDlanguage size correct rate correct rate correct rate

bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993

total 21000 13957 997 20850 993 20814 991

Conclusions

bull Language detector using maximal substring model

ndash Detect over 99 accuracy for 19 languages

ndash langdetect with tweet corpus even has 97 accuracy

bull If the corpus is maintained the precision will be still up

ndash There are still many mistakes (in particular da and no)

bull If metadata is added to features the precision will be still up

ndash How to add and train metadata at low cost

bull Desire to shrink the model without loss of precision

ndash Too large for application (gt100MB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 67

References

bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定

bull [Okanohara+ 09] Text Categorization with All Substring Features

bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs

bull [Cavnar+ 94] N-Gram-Based Text Categorization

bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 68

Page 7: Short Text Language Detection with Infinity-Gram

Whats Language Detection

bull To detect what language the input text written in

ndash Time fries like arrow rarr English

ndash Buona sera rarr Italian

bull It is prior for many language processing tasks

ndash Language model is built for each language

ndash Text search classification extraction translation

bull It is possible to detect for long enough and noiseless text with more than 99 accuracy [Cavnar+ 94]

ndash 3-gram model is used in many methods

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 9

SPAM or not

bull It is necessary to know that it is written in Polish

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 10

Document Categorization with Naive Bayes Classifier

bull Categorize a document 119883 = (119883119894) into category 119862119896

ndash A document 119883 is represented as collection of words 119883119894 (bag-of-words)

bull Word probability assumes conditionally independent on each category

ndash 119901 119883 119862119896 = 119901 119883119894 119862k119894 (from independent hypothesis)

ndash where 119901(119883119894|119862) rate of word frequency for category

bull Estimate the category 119862k to maximize posterior

ndash 119901 119862k 119883 =119901 119883 119862k 119901 119862k

119901 119883prop 119901(119862k) 119901(119883119894|119862k)119894

ndash where 119901(119862k) prior for category

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 11

Language Detection with Naive Bayes Classifier

bull Document categorization with language

labels

ndash Categorize documents into English Japanese

and so on

bull Use character n-gram as features

ndash Unicode code point n-gram strictly speaking

ndash Assume character encoding of the document is

already known

bull Most applications know encoding of inside text data

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 12

Why Use n-Gram to Detect Language

bull Each language has proper characters and spelling rules

ndash ldquoeacuterdquo is often used in Spanish Italian and so on but not in English in principle

ndash There are many words which start with ldquoZrdquo in German but not in English

ndash There are many words which start with ldquoCrdquo in English but not in German

ndash Spelling ldquoThrdquo is often used in English but not in the other languages

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 13

T h i s T h i s larr1-gram

T Th hi is s larr2-gram

Th Thi his is larr3-gram

C L Z Th

English 075 047 002 074

German 010 037 053 003

French 038 069 001 001

language-detection(langdetect) (Nakatani 2010)

bull Language detection library for Java

ndash httpcodegooglecomplanguage-detection

ndash Apache License 20

ndash Character 3-gram + Bayesian filter

ndash Various normalizations + Feature sampling

bull 99 over precision for 53 languages

ndash Training with Wikipedia abstract

ndash Widely support including Asian languages

ndash Adopted by Apache Solr

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 14

Estimation with News Text

bull Test for crawled news text from web in 49 languages Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 15

Language size accuracyaf Afrikaans 200 199 (9950)ar Arabic 200 200 (10000)bg Bulgarian 200 200 (10000)bn Bengali 200 200 (10000)cs Czech 200 200 (10000)da Dannish 200 179 (8950)de German 200 200 (10000)el Greek 200 200 (10000)en English 200 200 (10000)es Spanish 200 200 (10000)fa Persian 200 200 (10000)fi Finnish 200 200 (10000)fr French 200 200 (10000)gu Gujarati 200 200 (10000)he Hebrew 200 200 (10000)hi Hindi 200 200 (10000)hr Croatian 200 200 (10000)hu Hungarian 200 200 (10000)id Indonesian 200 200 (10000)it Italian 200 200 (10000)ja Japanese 200 200 (10000)kn Kannada 200 200 (10000)ko Korean 200 200 (10000)mk Macedonian 200 200 (10000)ml Malayalam 200 200 (10000)

Language size accuracymr Marathi 200 200 (10000)ne Nepali 200 200 (10000)nl Dutch 200 200 (10000)no Norwegian 200 199 (9950)pa Punjabi 200 200 (10000)pl Polish 200 200 (10000)pt Portuguese 200 200 (10000)ro Romanian 200 200 (10000)ru Russian 200 200 (10000)sk Slovak 200 200 (10000)so Somali 200 200 (10000)sq Albanian 200 200 (10000)sv Swedish 200 200 (10000)sw Swahili 200 200 (10000)ta Tamil 200 200 (10000)te Telugu 200 200 (10000)th Thai 200 200 (10000)tl Tagalog 200 200 (10000)tr Turkish 200 200 (10000)uk Ukrainian 200 200 (10000)ur Urdu 200 200 (10000)vi Vietnamese 200 200 (10000)

zh-cn Simplified Chinese 200 200 (10000)zh-tw Traditional Chinese 200 200 (10000)

total 9800 9777 (9977)

Estimation with Europarl datasets

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 16

bull Test for 1000 samples for each

language from Europarl Parallel Corpus

ndash from the proceedings of the European Parliament

ndash httpwwwstatmtorgeuroparl

bull httpcodegooglecomplanguage-

detectiondownloadsdetailname=eur

oparl-testzip

language size correct accuracybg Bulgarian 1000 988 988cs Czech 1000 994 994da Dannish 1000 968 968de German 1000 998 998el Greek 1000 1000 1000en English 1000 996 996es Spanish 1000 996 996et Estonian 1000 996 996fi Finnish 1000 998 998fr French 1000 999 999hu Hungarian 1000 999 999it Italian 1000 999 999lt Lithuanian 1000 997 997lv Latvian 1000 999 999nl Dutch 1000 974 974pl Polish 1000 999 999pt Portuguese 1000 996 996ro Romanian 1000 999 999sk Slovak 1000 988 988sl Slovene 1000 976 976sv Swedish 1000 991 991

total 21000 20850 993

Language Detection has been over isnt it

17

We still have ENEMY to beat

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 18

Twitter Language Detection with the Existing Methods

bull Only 90-95 accuracy

for tweet corpus

bull LD = language-detection

bull CLD = Chromium Compact Language

Detection

ndash httpcodegooglecompchromium-

compact-language-detector

ndash regard ms(Malay) as id(Indonesian)

bull Tika = Apache Tika

ndash httptikaapacheorg

ndash Estimate on 15 languages which Tika

supports in our tweet corpus

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 19

language LD CLD Tikaca Catalan 953 930 838cs Czech 963 966 ----da Dannish 945 907 587de German 866 968 731en English 883 974 547es Spanish 915 905 444fi Finnish 989 994 948fr French 950 945 674hu Hungarian 858 890 762id Indonesian 897 928 ----it Italian 962 938 871nl Dutch 695 932 650no Norwegian 960 749 686pl Polish 980 978 888pt Portuguese 880 886 474ro Romanian 928 961 826sv Swedish 960 964 756tr Turkish 976 974 ----vi Vietnamese 987 989 ----

total 922 938 700

Chromium Compact Language Detection (CLD)

bull Porting the language detector from

Google Chromium ndash httpcodegooglecompchromium-compact-language-detector

ndash Implementation in C++ Python binding

ndash of supported languages CLD = 76

langdetect = 53

ndash Accuracy CLD = 9882 langdetect =

9922

bull for 17 languages on Europarl datasets bull httpblogmikemccandlesscom201110accuracy-and-performance-of-googleshtml

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 20

Is twitter Language Detection difficult (1)

bull Tweet is too short to extract 3-gram features

ndash At most 140 characters on twitter

ndash URLs mentions and hashtags are not useful to

detect

bull LIGA [Tromp+ 11]

ndash Graph-features based on 3-gram

bull Add long distance features

bull 95~98 accuracy for twitter Language Detection

bull 6 languages (de en es fr it nl)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 21

Is twitter Language Detection difficult (2)

bull Tweet is too noisy

ndash Representations against the languages orthography often appear

ndash Acronym Abbreviation lengthened word (like Cooooolll)

bull Likelihood of tweet tends to get smaller on normal language model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

OMG Oh My God

LOL Laughing Out Loud

LMAO Laughing My Ass Out

F4F Follow for Follow

MDR Mort de Rire (French)

TKT Ne tlsquoInquiegravete Pas (Fr)

u you

ur your

4 for

i0u I love you

k che (Italian)

anke anche(Italian)

Letter k isnt used in Italian

22

Motivation to Detect Short Text Language

bull There are many small chunks of text in addition to twitter

ndash Schedule search query bulletin board and so on

ndash There are many questions about short text detection in the Issues Board of langdetect Project

bull httpcodegooglecomplanguage-detectionissuesdetailid=10

bull Detection for multi-language mixed text

ndash Cut the target document in paragraphs or lines

ndash Detect for each short text

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 23

Our Goal

bull Over 99 accuracy

ndash However it is too difficult to detect one

word sentence

ndash Our Goal is 99+ accurate detection for

sentence with more than 3 words

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 24

We need

bull Rich feature extractable model from

short text

ndash Maximal substring model

(infin-gram Logistic Regression)

bull and twitter-specific Language model

or Corpus to construct it

ndash about 700K tweet corpus with language

label

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 25

Proposal Method

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 26

How to increase features from 3-grams

bull The more n the more features

bull Maximum at n=infin that is all substring

ndash But it has O(T2) order

gram of n-gram

freq≧1 freq≧2 freq≧10

1 79 72 57

2 1896 1533 902

3 15970 10369 4525

4 64966 33941 10534

5 167543 69719 15538

6 323749 107861 18970

7 524634 142954 21093

8 760719 171995 22159

9 921361 193995 22696

cumulative distributuion of feature length for 5090 normalized English tweets (300KB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 27

Text Categorization with All Substring Features [Okanohara+ 09]

bull Multiclass Logistic Regression using all

substrings as features

ndash Maximal Substring makes the equivalent

model that can be constructed in linear

time

ndash Store features into TRIE fast prediction

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 28

Maximal Substring (1)

bull Define a containment(semi-order)

among non empty substrings

abracadabra

ndash ldquorardquo sub ldquobraldquo hArr all rdquorardquo occur

as the substring of ldquobrardquo

ndash ldquoardquo nsub ldquoraldquo hArr ldquoardquo occur in not only ldquoraldquo

but also ldquocardquo It is strictly defined with also its position in the substring

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 29

Maximal Substring (2)

bull Each equivalent class formed by the containment relationship has a unique maximal element that is named Maximal Substring

bull Maximal substrings of abracadabra are a abra and abracadabra

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 30

via httpdhatenanejpnokuno201202031328237067

Maximal Substring and Infinity-Gram

bull Frequencies of substrings that have a containment relationship always equal

bull In the model with linear combination of features it is possible to enclose the common feature values

bull Logistic regression with maximal substrings is equivalent to the one with infinity-grams

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Although the equivalence collapses for test set

we assumes that it can be approximated by a sufficiently large training set

31

Extended Suffix Array

bull Extended Suffix Array consists of

ndash SA=Suffix Array

ndash L=Longest Common Prefixes

ndash B=Burrows-Wheelers Transformed text

bull A maximal substring that occurs more than once corresponds to a internal node of Suffix Tree which is equivalent to a suffix with Lgt0 and BWT has more than 1 character type

ndash They can be calculated on linear time

bull esaxx Okanoharas implement of ESA

ndash httpcodegooglecompesaxx

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 32

via [Okanohara+ 09]

Corpus and Normalization

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 33

Target Languages

bull Limit character type to detect

ndash In short text detection mixed text can be

divided to type of characters

bull Latin alphabet language

ndash The most difficult alphabet type to detect

ndash Languages which speakers are over 5

million are more than 25

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 34

Whats Latin Alphabet

bull Latin alphabet ne ascii alphabet

ndash aring ą aelig eth Ħ ŋ and so on

bull They are assigned to 9 code blocks in Unicode

Range Name Supplement

U+0000-007F Basic Latin ascii

U+0080-00FF Latin-1 Supplement Most languages are covered with these U+0100-017F Latin Extended-A

U+0180-024F Latin Extended-B Rumanian

U+0250-02AF IPA Extensions

U+0300-036F Combining Diacritical Marks for tone symbol composition

U+1E00-1EFF Latin Extended Additional Vietnamese

U+2C60-2C7F Latin Extended-C These arenrsquot used by almost all present languages U+A720-A7FF Latin Extended-D

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 35

Latin Alphabets in Unicode Codepoint Chart

for Vietnamese only use often use sometimes

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 36

How to Create Corpus

bull Collect tweets with sample method of

twitter Streaming API

ndash Sampling 1 of all tweets (about 2

million tweets)

ndash Tweets in Latin alphabet language

account for 60 of them

bull The rest is only to annotate language

labels to these tweets

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 37

Language Label Annotation

bull Group tweets by their timezone

ndash French tweets account for about 1 of all ones

ndash But they account for 50 of ones in Paris

timezone only

bull Annotate tentative labels to tweets using

langdetect

ndash Remove non-French tweets from ones labeled lsquofrrsquo

ndash Recover French tweets from ones not labeled lsquofrrsquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 38

( 20 of the whole tweets have no timezone)

How to annotate

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 39

Swedish Norwegian Danish Vietnamese Lithuanian

Czech Hungarian Catalan Rumanian and Polish guides in turn

Created Corpus

bull Noiseless tweets for training data

bull Noiseful tweets with more than 3 words as test data

bull Work with Rauacutel Velaz and Hiroshi Manabe for Catalan corpus creation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 40

language training testca Catalan 9089 5082cs Czech 9082 7682da Dannish 7388 5524de German 44448 10065en English 44520 10168es Spanish 44118 10265fi Finnish 8087 7050fr French 44339 10098hu Hungarian 10030 4904id Indonesian 44722 10181it Italian 43366 10152nl Dutch 44682 10007no Norwegian 10124 8496pl Polish 16771 10152pt Portuguese 44215 10208ro Romanian 10021 5911sv Swedish 44054 10032tr Turkish 44703 10308vi Vietnamese 15030 10488

total 538789 166773

Simple Language Detection

bull Language detector can be constructed

from maximal substring model and

twitter corpus

ndash It still gets at most 98 accuracy

bull We guess it is necessary to reduce bias

ndash data size bias

ndash language-specific bias

ndash twitter-specific bias

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 41

Bias by Data Size

bull Tweet size in each language has huge bias

bull Level them out by sampling with replacement from each language up to the largest data

ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement

English

Portuguese

Spanish

Indonesian

Dutch

French

German

Turkish

Italian

Swedish

othersShort Text Language Detection with Infinity-Gram

(NAIST Seminar) 42

Convert to Lowercase on Multiple Languages

bull Conversion into lower case saves corpus and compresses model

bull But the lower case of I (U+0049) in Turkish differs from others

bull Convert to lower case excluding lsquoIrsquo

Upper case Lower case

Turkish

Azerbaijani

I (U+0049) ı (U+0131)

İ (U+0130) i (U+0069)

Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 43

Normalization for Rumanian

bull Rumanian uses acirc ă icirc ș ț in addition to a-z

bull There are 2 character type as st with a ldquobeardrdquo

ndash U+015E-F U+0162-3 st with cedilla

ndash U+0218-B st with comma below

bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia

bull The 2 code has the same design in some fonts

ndash Indistinguishable

ș ş U+0219 U+015F

ț ţ U+021B U+0163

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

44

Rumanian Character Affairs on PC

bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently

ndash 1989 Democratization in Rumania

ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode

ndash 2007 Rumania seated in the EU

ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 45

lsquost with cedillarsquo is used

on an advertisement board

in Bucharest

Normalization for Substitute Characters

bull lsquost with cedillarsquo are substitute characters

ndash But they are more popular than the others

ndash with cedilla with comma = 2 1

ndash ldquoRumanian IMErdquo outputs the substitutes too D

bull Regard lsquost with commarsquo as lsquost with cedillarsquo

ț ţ U+021B U+0163

I reckon it is similar to the relationship of

Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 46

Arabic Character Normalization (on language-detection)

bull Arabic and Persian have the similar trouble too

bull Character lsquoyehrsquo in Farsi corresponds to 2 code points

ndash Wikipedia uses ی (U+06cc Farsi yeh) only

ndash News uses ي(U+064a Arabic yeh) only

bull U+064a is a substitute in Farsi

ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc

ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails

bull Regard U+06cc as U+064a

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 47

Normalization for Vietnamese (1)

bull Vietnamese has 12 vowels

ndash a ă acirc e ecirc i y o ocirc ơ u ư

bull Vietnamese has 6 tones

ndash a ả agrave atilde aacute ạ

ndash These tone symbols are used also in general documents like news

bull The tone symbols can be appended to all vowels

ndash 12 6 = 72

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 48

Normalization for Vietnamese (2)

bull Representation of vowels with

tones

1 Use U+1ea0 - U+1ef9

bull ẵ = U+1eb5

2 Combine with Diacritical Marks

bull ẵ = U+0103 U+0303

ndash Half and half on news and tweet

bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 49

CJK-Kanji Normalization (1) (on language-detection)

bull CJK-Kanji has too many characters(more than 20K)

ndash Other character types have only 30-50 characters

bull The character space is very sparse

ndash Characters that donrsquot occur in the training corpus have no probabilities

bull eg 谢谢 Kanji for person name

ndash Common frequent characters are too strong

bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese

bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 50

CJK-Kanji Normalization (2) (on language-detection)

bull Group Kanjis by frequency and normalize each group to the representative character

ndash (1) K-means clustering

bull Use tf-idf on Wikipedia and Google News

bull K=50 (size of ascii alphabet = 52)

ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese

bull Simplified Chinese 现代汉语常用字表(3500)

bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)

bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998

ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much

bull Generate 130 clusters from product of (1) and (2)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 51

Normalization for twitter

bull Remove simply

ndash URL

ndash mention

ndash hash tag

ndash RT

ndash face mark using alphabet like XD p

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 52

Normalization for twitter-Specific Representation

bull How to Like lsquocoooooooollllllrsquo

bull Case 1 Make a normalization dictionary using [Brody+ 2011]

ndash Unsupervised normalization like coooollll rarr cool

ndash It canrsquot handle words that are not in the dictionary

bull Case 2 If the same character continues in more than 3 Shrink it to 2

ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of

bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on

bull Acronym (like WWW СССР) is not useful for language detection

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 53

Laugh Normalization

bull There are various laughs on each language

ndash HOW MUCH DO YOU LOVE COACH BEISTE

HHAHAHAHAHAH

ndash Hihihihi ) Habe ich regulaumlr 2x die Woche

ndash Tafil con eso Jajajajajajaja

ndash Malo Jejejeje XP

ndash kekeke chỗ đoacute lagravem aacuteo được ko em

bull Shrink them to double

ndash hahahha rArr haha

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 54

Implementation and Estimation

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 55

Language Detection with Infinity-Gram (ldig)

bull tweet language detection for Latin

alphabet

ndash httpsgithubcomshuyoldig

bull MIT license

bull Distribute also the trained model here

ndash infin-gram LR(maximal substring) [Okanohara+ 09]

ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]

ndash Double Array

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 56

Usage (1) Model Initialization

bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]

ndash Extract features from corpus and initialize model

ndash -m model directory

ndash -x path of maximal substring extractor (execute as external process)

ndash --ff Ignore less than the specified value

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 57

Maximal String Extractor

bull maxsubst [input file] [output file]

ndash Input as multiple line text

bull Replace TABs to ldquo ldquo line feeds to U+0001 in it

ndash Output as rdquo[features]yent[frequency]rdquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 58

Usage (2) Learn

bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]

ndash Learn the model using the corpus on 1 cycle of SGD

ndash -e learning rate of SGD

ndash -r regularizer of L1 regularization

ndash --wr what times to regularize for whole parameters

bull Parameters are too many to regularize the whle ones every step

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 59

Usage (3) Shrink Model

bull ldigpy -m [model] --shrink

ndash Remove Unefficient features(all

parameters of which are 0) from the

model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 60

Usage (4) Detect Language

bull ldigpy -m [model] [test data]

ndash Detect languages of test data and output

its result and summary

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 61

Data Format

bull Training and test data

ndash [correct label]yent[meta data]yent[text]

en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD

62

Usage (5) Estimation Tool

bull serverpy -m [model] -p [port number]

ndash Open httplocalhost[port] after it is executed

ndash Output their language probabilities contained features and their parameters for a text inputed in the text area

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 63

Estimation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus

As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size

64

language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992

total 166711 165053 9901 922 974

Estimation for LIGA dataset

bull Estimate using LIGA[Tromp+ 11] dataset

with 9066 tweets for 6 languages

ndash httpwwwwintuenl~mpechenprojectssmm

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Use 19 language model

65

Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996

total 9066 8992 992

Estimation for Europarl Dataset

Only supported languages for ldig

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 66

ldig langdetect CLDlanguage size correct rate correct rate correct rate

bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993

total 21000 13957 997 20850 993 20814 991

Conclusions

bull Language detector using maximal substring model

ndash Detect over 99 accuracy for 19 languages

ndash langdetect with tweet corpus even has 97 accuracy

bull If the corpus is maintained the precision will be still up

ndash There are still many mistakes (in particular da and no)

bull If metadata is added to features the precision will be still up

ndash How to add and train metadata at low cost

bull Desire to shrink the model without loss of precision

ndash Too large for application (gt100MB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 67

References

bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定

bull [Okanohara+ 09] Text Categorization with All Substring Features

bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs

bull [Cavnar+ 94] N-Gram-Based Text Categorization

bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 68

Page 8: Short Text Language Detection with Infinity-Gram

SPAM or not

bull It is necessary to know that it is written in Polish

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 10

Document Categorization with Naive Bayes Classifier

bull Categorize a document 119883 = (119883119894) into category 119862119896

ndash A document 119883 is represented as collection of words 119883119894 (bag-of-words)

bull Word probability assumes conditionally independent on each category

ndash 119901 119883 119862119896 = 119901 119883119894 119862k119894 (from independent hypothesis)

ndash where 119901(119883119894|119862) rate of word frequency for category

bull Estimate the category 119862k to maximize posterior

ndash 119901 119862k 119883 =119901 119883 119862k 119901 119862k

119901 119883prop 119901(119862k) 119901(119883119894|119862k)119894

ndash where 119901(119862k) prior for category

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 11

Language Detection with Naive Bayes Classifier

bull Document categorization with language

labels

ndash Categorize documents into English Japanese

and so on

bull Use character n-gram as features

ndash Unicode code point n-gram strictly speaking

ndash Assume character encoding of the document is

already known

bull Most applications know encoding of inside text data

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 12

Why Use n-Gram to Detect Language

bull Each language has proper characters and spelling rules

ndash ldquoeacuterdquo is often used in Spanish Italian and so on but not in English in principle

ndash There are many words which start with ldquoZrdquo in German but not in English

ndash There are many words which start with ldquoCrdquo in English but not in German

ndash Spelling ldquoThrdquo is often used in English but not in the other languages

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 13

T h i s T h i s larr1-gram

T Th hi is s larr2-gram

Th Thi his is larr3-gram

C L Z Th

English 075 047 002 074

German 010 037 053 003

French 038 069 001 001

language-detection(langdetect) (Nakatani 2010)

bull Language detection library for Java

ndash httpcodegooglecomplanguage-detection

ndash Apache License 20

ndash Character 3-gram + Bayesian filter

ndash Various normalizations + Feature sampling

bull 99 over precision for 53 languages

ndash Training with Wikipedia abstract

ndash Widely support including Asian languages

ndash Adopted by Apache Solr

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 14

Estimation with News Text

bull Test for crawled news text from web in 49 languages Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 15

Language size accuracyaf Afrikaans 200 199 (9950)ar Arabic 200 200 (10000)bg Bulgarian 200 200 (10000)bn Bengali 200 200 (10000)cs Czech 200 200 (10000)da Dannish 200 179 (8950)de German 200 200 (10000)el Greek 200 200 (10000)en English 200 200 (10000)es Spanish 200 200 (10000)fa Persian 200 200 (10000)fi Finnish 200 200 (10000)fr French 200 200 (10000)gu Gujarati 200 200 (10000)he Hebrew 200 200 (10000)hi Hindi 200 200 (10000)hr Croatian 200 200 (10000)hu Hungarian 200 200 (10000)id Indonesian 200 200 (10000)it Italian 200 200 (10000)ja Japanese 200 200 (10000)kn Kannada 200 200 (10000)ko Korean 200 200 (10000)mk Macedonian 200 200 (10000)ml Malayalam 200 200 (10000)

Language size accuracymr Marathi 200 200 (10000)ne Nepali 200 200 (10000)nl Dutch 200 200 (10000)no Norwegian 200 199 (9950)pa Punjabi 200 200 (10000)pl Polish 200 200 (10000)pt Portuguese 200 200 (10000)ro Romanian 200 200 (10000)ru Russian 200 200 (10000)sk Slovak 200 200 (10000)so Somali 200 200 (10000)sq Albanian 200 200 (10000)sv Swedish 200 200 (10000)sw Swahili 200 200 (10000)ta Tamil 200 200 (10000)te Telugu 200 200 (10000)th Thai 200 200 (10000)tl Tagalog 200 200 (10000)tr Turkish 200 200 (10000)uk Ukrainian 200 200 (10000)ur Urdu 200 200 (10000)vi Vietnamese 200 200 (10000)

zh-cn Simplified Chinese 200 200 (10000)zh-tw Traditional Chinese 200 200 (10000)

total 9800 9777 (9977)

Estimation with Europarl datasets

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 16

bull Test for 1000 samples for each

language from Europarl Parallel Corpus

ndash from the proceedings of the European Parliament

ndash httpwwwstatmtorgeuroparl

bull httpcodegooglecomplanguage-

detectiondownloadsdetailname=eur

oparl-testzip

language size correct accuracybg Bulgarian 1000 988 988cs Czech 1000 994 994da Dannish 1000 968 968de German 1000 998 998el Greek 1000 1000 1000en English 1000 996 996es Spanish 1000 996 996et Estonian 1000 996 996fi Finnish 1000 998 998fr French 1000 999 999hu Hungarian 1000 999 999it Italian 1000 999 999lt Lithuanian 1000 997 997lv Latvian 1000 999 999nl Dutch 1000 974 974pl Polish 1000 999 999pt Portuguese 1000 996 996ro Romanian 1000 999 999sk Slovak 1000 988 988sl Slovene 1000 976 976sv Swedish 1000 991 991

total 21000 20850 993

Language Detection has been over isnt it

17

We still have ENEMY to beat

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 18

Twitter Language Detection with the Existing Methods

bull Only 90-95 accuracy

for tweet corpus

bull LD = language-detection

bull CLD = Chromium Compact Language

Detection

ndash httpcodegooglecompchromium-

compact-language-detector

ndash regard ms(Malay) as id(Indonesian)

bull Tika = Apache Tika

ndash httptikaapacheorg

ndash Estimate on 15 languages which Tika

supports in our tweet corpus

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 19

language LD CLD Tikaca Catalan 953 930 838cs Czech 963 966 ----da Dannish 945 907 587de German 866 968 731en English 883 974 547es Spanish 915 905 444fi Finnish 989 994 948fr French 950 945 674hu Hungarian 858 890 762id Indonesian 897 928 ----it Italian 962 938 871nl Dutch 695 932 650no Norwegian 960 749 686pl Polish 980 978 888pt Portuguese 880 886 474ro Romanian 928 961 826sv Swedish 960 964 756tr Turkish 976 974 ----vi Vietnamese 987 989 ----

total 922 938 700

Chromium Compact Language Detection (CLD)

bull Porting the language detector from

Google Chromium ndash httpcodegooglecompchromium-compact-language-detector

ndash Implementation in C++ Python binding

ndash of supported languages CLD = 76

langdetect = 53

ndash Accuracy CLD = 9882 langdetect =

9922

bull for 17 languages on Europarl datasets bull httpblogmikemccandlesscom201110accuracy-and-performance-of-googleshtml

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 20

Is twitter Language Detection difficult (1)

bull Tweet is too short to extract 3-gram features

ndash At most 140 characters on twitter

ndash URLs mentions and hashtags are not useful to

detect

bull LIGA [Tromp+ 11]

ndash Graph-features based on 3-gram

bull Add long distance features

bull 95~98 accuracy for twitter Language Detection

bull 6 languages (de en es fr it nl)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 21

Is twitter Language Detection difficult (2)

bull Tweet is too noisy

ndash Representations against the languages orthography often appear

ndash Acronym Abbreviation lengthened word (like Cooooolll)

bull Likelihood of tweet tends to get smaller on normal language model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

OMG Oh My God

LOL Laughing Out Loud

LMAO Laughing My Ass Out

F4F Follow for Follow

MDR Mort de Rire (French)

TKT Ne tlsquoInquiegravete Pas (Fr)

u you

ur your

4 for

i0u I love you

k che (Italian)

anke anche(Italian)

Letter k isnt used in Italian

22

Motivation to Detect Short Text Language

bull There are many small chunks of text in addition to twitter

ndash Schedule search query bulletin board and so on

ndash There are many questions about short text detection in the Issues Board of langdetect Project

bull httpcodegooglecomplanguage-detectionissuesdetailid=10

bull Detection for multi-language mixed text

ndash Cut the target document in paragraphs or lines

ndash Detect for each short text

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 23

Our Goal

bull Over 99 accuracy

ndash However it is too difficult to detect one

word sentence

ndash Our Goal is 99+ accurate detection for

sentence with more than 3 words

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 24

We need

bull Rich feature extractable model from

short text

ndash Maximal substring model

(infin-gram Logistic Regression)

bull and twitter-specific Language model

or Corpus to construct it

ndash about 700K tweet corpus with language

label

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 25

Proposal Method

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 26

How to increase features from 3-grams

bull The more n the more features

bull Maximum at n=infin that is all substring

ndash But it has O(T2) order

gram of n-gram

freq≧1 freq≧2 freq≧10

1 79 72 57

2 1896 1533 902

3 15970 10369 4525

4 64966 33941 10534

5 167543 69719 15538

6 323749 107861 18970

7 524634 142954 21093

8 760719 171995 22159

9 921361 193995 22696

cumulative distributuion of feature length for 5090 normalized English tweets (300KB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 27

Text Categorization with All Substring Features [Okanohara+ 09]

bull Multiclass Logistic Regression using all

substrings as features

ndash Maximal Substring makes the equivalent

model that can be constructed in linear

time

ndash Store features into TRIE fast prediction

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 28

Maximal Substring (1)

bull Define a containment(semi-order)

among non empty substrings

abracadabra

ndash ldquorardquo sub ldquobraldquo hArr all rdquorardquo occur

as the substring of ldquobrardquo

ndash ldquoardquo nsub ldquoraldquo hArr ldquoardquo occur in not only ldquoraldquo

but also ldquocardquo It is strictly defined with also its position in the substring

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 29

Maximal Substring (2)

bull Each equivalent class formed by the containment relationship has a unique maximal element that is named Maximal Substring

bull Maximal substrings of abracadabra are a abra and abracadabra

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 30

via httpdhatenanejpnokuno201202031328237067

Maximal Substring and Infinity-Gram

bull Frequencies of substrings that have a containment relationship always equal

bull In the model with linear combination of features it is possible to enclose the common feature values

bull Logistic regression with maximal substrings is equivalent to the one with infinity-grams

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Although the equivalence collapses for test set

we assumes that it can be approximated by a sufficiently large training set

31

Extended Suffix Array

bull Extended Suffix Array consists of

ndash SA=Suffix Array

ndash L=Longest Common Prefixes

ndash B=Burrows-Wheelers Transformed text

bull A maximal substring that occurs more than once corresponds to a internal node of Suffix Tree which is equivalent to a suffix with Lgt0 and BWT has more than 1 character type

ndash They can be calculated on linear time

bull esaxx Okanoharas implement of ESA

ndash httpcodegooglecompesaxx

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 32

via [Okanohara+ 09]

Corpus and Normalization

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 33

Target Languages

bull Limit character type to detect

ndash In short text detection mixed text can be

divided to type of characters

bull Latin alphabet language

ndash The most difficult alphabet type to detect

ndash Languages which speakers are over 5

million are more than 25

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 34

Whats Latin Alphabet

bull Latin alphabet ne ascii alphabet

ndash aring ą aelig eth Ħ ŋ and so on

bull They are assigned to 9 code blocks in Unicode

Range Name Supplement

U+0000-007F Basic Latin ascii

U+0080-00FF Latin-1 Supplement Most languages are covered with these U+0100-017F Latin Extended-A

U+0180-024F Latin Extended-B Rumanian

U+0250-02AF IPA Extensions

U+0300-036F Combining Diacritical Marks for tone symbol composition

U+1E00-1EFF Latin Extended Additional Vietnamese

U+2C60-2C7F Latin Extended-C These arenrsquot used by almost all present languages U+A720-A7FF Latin Extended-D

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 35

Latin Alphabets in Unicode Codepoint Chart

for Vietnamese only use often use sometimes

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 36

How to Create Corpus

bull Collect tweets with sample method of

twitter Streaming API

ndash Sampling 1 of all tweets (about 2

million tweets)

ndash Tweets in Latin alphabet language

account for 60 of them

bull The rest is only to annotate language

labels to these tweets

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 37

Language Label Annotation

bull Group tweets by their timezone

ndash French tweets account for about 1 of all ones

ndash But they account for 50 of ones in Paris

timezone only

bull Annotate tentative labels to tweets using

langdetect

ndash Remove non-French tweets from ones labeled lsquofrrsquo

ndash Recover French tweets from ones not labeled lsquofrrsquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 38

( 20 of the whole tweets have no timezone)

How to annotate

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 39

Swedish Norwegian Danish Vietnamese Lithuanian

Czech Hungarian Catalan Rumanian and Polish guides in turn

Created Corpus

bull Noiseless tweets for training data

bull Noiseful tweets with more than 3 words as test data

bull Work with Rauacutel Velaz and Hiroshi Manabe for Catalan corpus creation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 40

language training testca Catalan 9089 5082cs Czech 9082 7682da Dannish 7388 5524de German 44448 10065en English 44520 10168es Spanish 44118 10265fi Finnish 8087 7050fr French 44339 10098hu Hungarian 10030 4904id Indonesian 44722 10181it Italian 43366 10152nl Dutch 44682 10007no Norwegian 10124 8496pl Polish 16771 10152pt Portuguese 44215 10208ro Romanian 10021 5911sv Swedish 44054 10032tr Turkish 44703 10308vi Vietnamese 15030 10488

total 538789 166773

Simple Language Detection

bull Language detector can be constructed

from maximal substring model and

twitter corpus

ndash It still gets at most 98 accuracy

bull We guess it is necessary to reduce bias

ndash data size bias

ndash language-specific bias

ndash twitter-specific bias

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 41

Bias by Data Size

bull Tweet size in each language has huge bias

bull Level them out by sampling with replacement from each language up to the largest data

ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement

English

Portuguese

Spanish

Indonesian

Dutch

French

German

Turkish

Italian

Swedish

othersShort Text Language Detection with Infinity-Gram

(NAIST Seminar) 42

Convert to Lowercase on Multiple Languages

bull Conversion into lower case saves corpus and compresses model

bull But the lower case of I (U+0049) in Turkish differs from others

bull Convert to lower case excluding lsquoIrsquo

Upper case Lower case

Turkish

Azerbaijani

I (U+0049) ı (U+0131)

İ (U+0130) i (U+0069)

Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 43

Normalization for Rumanian

bull Rumanian uses acirc ă icirc ș ț in addition to a-z

bull There are 2 character type as st with a ldquobeardrdquo

ndash U+015E-F U+0162-3 st with cedilla

ndash U+0218-B st with comma below

bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia

bull The 2 code has the same design in some fonts

ndash Indistinguishable

ș ş U+0219 U+015F

ț ţ U+021B U+0163

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

44

Rumanian Character Affairs on PC

bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently

ndash 1989 Democratization in Rumania

ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode

ndash 2007 Rumania seated in the EU

ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 45

lsquost with cedillarsquo is used

on an advertisement board

in Bucharest

Normalization for Substitute Characters

bull lsquost with cedillarsquo are substitute characters

ndash But they are more popular than the others

ndash with cedilla with comma = 2 1

ndash ldquoRumanian IMErdquo outputs the substitutes too D

bull Regard lsquost with commarsquo as lsquost with cedillarsquo

ț ţ U+021B U+0163

I reckon it is similar to the relationship of

Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 46

Arabic Character Normalization (on language-detection)

bull Arabic and Persian have the similar trouble too

bull Character lsquoyehrsquo in Farsi corresponds to 2 code points

ndash Wikipedia uses ی (U+06cc Farsi yeh) only

ndash News uses ي(U+064a Arabic yeh) only

bull U+064a is a substitute in Farsi

ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc

ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails

bull Regard U+06cc as U+064a

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 47

Normalization for Vietnamese (1)

bull Vietnamese has 12 vowels

ndash a ă acirc e ecirc i y o ocirc ơ u ư

bull Vietnamese has 6 tones

ndash a ả agrave atilde aacute ạ

ndash These tone symbols are used also in general documents like news

bull The tone symbols can be appended to all vowels

ndash 12 6 = 72

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 48

Normalization for Vietnamese (2)

bull Representation of vowels with

tones

1 Use U+1ea0 - U+1ef9

bull ẵ = U+1eb5

2 Combine with Diacritical Marks

bull ẵ = U+0103 U+0303

ndash Half and half on news and tweet

bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 49

CJK-Kanji Normalization (1) (on language-detection)

bull CJK-Kanji has too many characters(more than 20K)

ndash Other character types have only 30-50 characters

bull The character space is very sparse

ndash Characters that donrsquot occur in the training corpus have no probabilities

bull eg 谢谢 Kanji for person name

ndash Common frequent characters are too strong

bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese

bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 50

CJK-Kanji Normalization (2) (on language-detection)

bull Group Kanjis by frequency and normalize each group to the representative character

ndash (1) K-means clustering

bull Use tf-idf on Wikipedia and Google News

bull K=50 (size of ascii alphabet = 52)

ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese

bull Simplified Chinese 现代汉语常用字表(3500)

bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)

bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998

ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much

bull Generate 130 clusters from product of (1) and (2)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 51

Normalization for twitter

bull Remove simply

ndash URL

ndash mention

ndash hash tag

ndash RT

ndash face mark using alphabet like XD p

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 52

Normalization for twitter-Specific Representation

bull How to Like lsquocoooooooollllllrsquo

bull Case 1 Make a normalization dictionary using [Brody+ 2011]

ndash Unsupervised normalization like coooollll rarr cool

ndash It canrsquot handle words that are not in the dictionary

bull Case 2 If the same character continues in more than 3 Shrink it to 2

ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of

bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on

bull Acronym (like WWW СССР) is not useful for language detection

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 53

Laugh Normalization

bull There are various laughs on each language

ndash HOW MUCH DO YOU LOVE COACH BEISTE

HHAHAHAHAHAH

ndash Hihihihi ) Habe ich regulaumlr 2x die Woche

ndash Tafil con eso Jajajajajajaja

ndash Malo Jejejeje XP

ndash kekeke chỗ đoacute lagravem aacuteo được ko em

bull Shrink them to double

ndash hahahha rArr haha

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 54

Implementation and Estimation

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 55

Language Detection with Infinity-Gram (ldig)

bull tweet language detection for Latin

alphabet

ndash httpsgithubcomshuyoldig

bull MIT license

bull Distribute also the trained model here

ndash infin-gram LR(maximal substring) [Okanohara+ 09]

ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]

ndash Double Array

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 56

Usage (1) Model Initialization

bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]

ndash Extract features from corpus and initialize model

ndash -m model directory

ndash -x path of maximal substring extractor (execute as external process)

ndash --ff Ignore less than the specified value

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 57

Maximal String Extractor

bull maxsubst [input file] [output file]

ndash Input as multiple line text

bull Replace TABs to ldquo ldquo line feeds to U+0001 in it

ndash Output as rdquo[features]yent[frequency]rdquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 58

Usage (2) Learn

bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]

ndash Learn the model using the corpus on 1 cycle of SGD

ndash -e learning rate of SGD

ndash -r regularizer of L1 regularization

ndash --wr what times to regularize for whole parameters

bull Parameters are too many to regularize the whle ones every step

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 59

Usage (3) Shrink Model

bull ldigpy -m [model] --shrink

ndash Remove Unefficient features(all

parameters of which are 0) from the

model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 60

Usage (4) Detect Language

bull ldigpy -m [model] [test data]

ndash Detect languages of test data and output

its result and summary

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 61

Data Format

bull Training and test data

ndash [correct label]yent[meta data]yent[text]

en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD

62

Usage (5) Estimation Tool

bull serverpy -m [model] -p [port number]

ndash Open httplocalhost[port] after it is executed

ndash Output their language probabilities contained features and their parameters for a text inputed in the text area

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 63

Estimation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus

As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size

64

language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992

total 166711 165053 9901 922 974

Estimation for LIGA dataset

bull Estimate using LIGA[Tromp+ 11] dataset

with 9066 tweets for 6 languages

ndash httpwwwwintuenl~mpechenprojectssmm

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Use 19 language model

65

Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996

total 9066 8992 992

Estimation for Europarl Dataset

Only supported languages for ldig

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 66

ldig langdetect CLDlanguage size correct rate correct rate correct rate

bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993

total 21000 13957 997 20850 993 20814 991

Conclusions

bull Language detector using maximal substring model

ndash Detect over 99 accuracy for 19 languages

ndash langdetect with tweet corpus even has 97 accuracy

bull If the corpus is maintained the precision will be still up

ndash There are still many mistakes (in particular da and no)

bull If metadata is added to features the precision will be still up

ndash How to add and train metadata at low cost

bull Desire to shrink the model without loss of precision

ndash Too large for application (gt100MB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 67

References

bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定

bull [Okanohara+ 09] Text Categorization with All Substring Features

bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs

bull [Cavnar+ 94] N-Gram-Based Text Categorization

bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 68

Page 9: Short Text Language Detection with Infinity-Gram

Document Categorization with Naive Bayes Classifier

bull Categorize a document 119883 = (119883119894) into category 119862119896

ndash A document 119883 is represented as collection of words 119883119894 (bag-of-words)

bull Word probability assumes conditionally independent on each category

ndash 119901 119883 119862119896 = 119901 119883119894 119862k119894 (from independent hypothesis)

ndash where 119901(119883119894|119862) rate of word frequency for category

bull Estimate the category 119862k to maximize posterior

ndash 119901 119862k 119883 =119901 119883 119862k 119901 119862k

119901 119883prop 119901(119862k) 119901(119883119894|119862k)119894

ndash where 119901(119862k) prior for category

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 11

Language Detection with Naive Bayes Classifier

bull Document categorization with language

labels

ndash Categorize documents into English Japanese

and so on

bull Use character n-gram as features

ndash Unicode code point n-gram strictly speaking

ndash Assume character encoding of the document is

already known

bull Most applications know encoding of inside text data

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 12

Why Use n-Gram to Detect Language

bull Each language has proper characters and spelling rules

ndash ldquoeacuterdquo is often used in Spanish Italian and so on but not in English in principle

ndash There are many words which start with ldquoZrdquo in German but not in English

ndash There are many words which start with ldquoCrdquo in English but not in German

ndash Spelling ldquoThrdquo is often used in English but not in the other languages

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 13

T h i s T h i s larr1-gram

T Th hi is s larr2-gram

Th Thi his is larr3-gram

C L Z Th

English 075 047 002 074

German 010 037 053 003

French 038 069 001 001

language-detection(langdetect) (Nakatani 2010)

bull Language detection library for Java

ndash httpcodegooglecomplanguage-detection

ndash Apache License 20

ndash Character 3-gram + Bayesian filter

ndash Various normalizations + Feature sampling

bull 99 over precision for 53 languages

ndash Training with Wikipedia abstract

ndash Widely support including Asian languages

ndash Adopted by Apache Solr

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 14

Estimation with News Text

bull Test for crawled news text from web in 49 languages Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 15

Language size accuracyaf Afrikaans 200 199 (9950)ar Arabic 200 200 (10000)bg Bulgarian 200 200 (10000)bn Bengali 200 200 (10000)cs Czech 200 200 (10000)da Dannish 200 179 (8950)de German 200 200 (10000)el Greek 200 200 (10000)en English 200 200 (10000)es Spanish 200 200 (10000)fa Persian 200 200 (10000)fi Finnish 200 200 (10000)fr French 200 200 (10000)gu Gujarati 200 200 (10000)he Hebrew 200 200 (10000)hi Hindi 200 200 (10000)hr Croatian 200 200 (10000)hu Hungarian 200 200 (10000)id Indonesian 200 200 (10000)it Italian 200 200 (10000)ja Japanese 200 200 (10000)kn Kannada 200 200 (10000)ko Korean 200 200 (10000)mk Macedonian 200 200 (10000)ml Malayalam 200 200 (10000)

Language size accuracymr Marathi 200 200 (10000)ne Nepali 200 200 (10000)nl Dutch 200 200 (10000)no Norwegian 200 199 (9950)pa Punjabi 200 200 (10000)pl Polish 200 200 (10000)pt Portuguese 200 200 (10000)ro Romanian 200 200 (10000)ru Russian 200 200 (10000)sk Slovak 200 200 (10000)so Somali 200 200 (10000)sq Albanian 200 200 (10000)sv Swedish 200 200 (10000)sw Swahili 200 200 (10000)ta Tamil 200 200 (10000)te Telugu 200 200 (10000)th Thai 200 200 (10000)tl Tagalog 200 200 (10000)tr Turkish 200 200 (10000)uk Ukrainian 200 200 (10000)ur Urdu 200 200 (10000)vi Vietnamese 200 200 (10000)

zh-cn Simplified Chinese 200 200 (10000)zh-tw Traditional Chinese 200 200 (10000)

total 9800 9777 (9977)

Estimation with Europarl datasets

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 16

bull Test for 1000 samples for each

language from Europarl Parallel Corpus

ndash from the proceedings of the European Parliament

ndash httpwwwstatmtorgeuroparl

bull httpcodegooglecomplanguage-

detectiondownloadsdetailname=eur

oparl-testzip

language size correct accuracybg Bulgarian 1000 988 988cs Czech 1000 994 994da Dannish 1000 968 968de German 1000 998 998el Greek 1000 1000 1000en English 1000 996 996es Spanish 1000 996 996et Estonian 1000 996 996fi Finnish 1000 998 998fr French 1000 999 999hu Hungarian 1000 999 999it Italian 1000 999 999lt Lithuanian 1000 997 997lv Latvian 1000 999 999nl Dutch 1000 974 974pl Polish 1000 999 999pt Portuguese 1000 996 996ro Romanian 1000 999 999sk Slovak 1000 988 988sl Slovene 1000 976 976sv Swedish 1000 991 991

total 21000 20850 993

Language Detection has been over isnt it

17

We still have ENEMY to beat

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 18

Twitter Language Detection with the Existing Methods

bull Only 90-95 accuracy

for tweet corpus

bull LD = language-detection

bull CLD = Chromium Compact Language

Detection

ndash httpcodegooglecompchromium-

compact-language-detector

ndash regard ms(Malay) as id(Indonesian)

bull Tika = Apache Tika

ndash httptikaapacheorg

ndash Estimate on 15 languages which Tika

supports in our tweet corpus

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 19

language LD CLD Tikaca Catalan 953 930 838cs Czech 963 966 ----da Dannish 945 907 587de German 866 968 731en English 883 974 547es Spanish 915 905 444fi Finnish 989 994 948fr French 950 945 674hu Hungarian 858 890 762id Indonesian 897 928 ----it Italian 962 938 871nl Dutch 695 932 650no Norwegian 960 749 686pl Polish 980 978 888pt Portuguese 880 886 474ro Romanian 928 961 826sv Swedish 960 964 756tr Turkish 976 974 ----vi Vietnamese 987 989 ----

total 922 938 700

Chromium Compact Language Detection (CLD)

bull Porting the language detector from

Google Chromium ndash httpcodegooglecompchromium-compact-language-detector

ndash Implementation in C++ Python binding

ndash of supported languages CLD = 76

langdetect = 53

ndash Accuracy CLD = 9882 langdetect =

9922

bull for 17 languages on Europarl datasets bull httpblogmikemccandlesscom201110accuracy-and-performance-of-googleshtml

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 20

Is twitter Language Detection difficult (1)

bull Tweet is too short to extract 3-gram features

ndash At most 140 characters on twitter

ndash URLs mentions and hashtags are not useful to

detect

bull LIGA [Tromp+ 11]

ndash Graph-features based on 3-gram

bull Add long distance features

bull 95~98 accuracy for twitter Language Detection

bull 6 languages (de en es fr it nl)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 21

Is twitter Language Detection difficult (2)

bull Tweet is too noisy

ndash Representations against the languages orthography often appear

ndash Acronym Abbreviation lengthened word (like Cooooolll)

bull Likelihood of tweet tends to get smaller on normal language model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

OMG Oh My God

LOL Laughing Out Loud

LMAO Laughing My Ass Out

F4F Follow for Follow

MDR Mort de Rire (French)

TKT Ne tlsquoInquiegravete Pas (Fr)

u you

ur your

4 for

i0u I love you

k che (Italian)

anke anche(Italian)

Letter k isnt used in Italian

22

Motivation to Detect Short Text Language

bull There are many small chunks of text in addition to twitter

ndash Schedule search query bulletin board and so on

ndash There are many questions about short text detection in the Issues Board of langdetect Project

bull httpcodegooglecomplanguage-detectionissuesdetailid=10

bull Detection for multi-language mixed text

ndash Cut the target document in paragraphs or lines

ndash Detect for each short text

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 23

Our Goal

bull Over 99 accuracy

ndash However it is too difficult to detect one

word sentence

ndash Our Goal is 99+ accurate detection for

sentence with more than 3 words

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 24

We need

bull Rich feature extractable model from

short text

ndash Maximal substring model

(infin-gram Logistic Regression)

bull and twitter-specific Language model

or Corpus to construct it

ndash about 700K tweet corpus with language

label

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 25

Proposal Method

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 26

How to increase features from 3-grams

bull The more n the more features

bull Maximum at n=infin that is all substring

ndash But it has O(T2) order

gram of n-gram

freq≧1 freq≧2 freq≧10

1 79 72 57

2 1896 1533 902

3 15970 10369 4525

4 64966 33941 10534

5 167543 69719 15538

6 323749 107861 18970

7 524634 142954 21093

8 760719 171995 22159

9 921361 193995 22696

cumulative distributuion of feature length for 5090 normalized English tweets (300KB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 27

Text Categorization with All Substring Features [Okanohara+ 09]

bull Multiclass Logistic Regression using all

substrings as features

ndash Maximal Substring makes the equivalent

model that can be constructed in linear

time

ndash Store features into TRIE fast prediction

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 28

Maximal Substring (1)

bull Define a containment(semi-order)

among non empty substrings

abracadabra

ndash ldquorardquo sub ldquobraldquo hArr all rdquorardquo occur

as the substring of ldquobrardquo

ndash ldquoardquo nsub ldquoraldquo hArr ldquoardquo occur in not only ldquoraldquo

but also ldquocardquo It is strictly defined with also its position in the substring

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 29

Maximal Substring (2)

bull Each equivalent class formed by the containment relationship has a unique maximal element that is named Maximal Substring

bull Maximal substrings of abracadabra are a abra and abracadabra

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 30

via httpdhatenanejpnokuno201202031328237067

Maximal Substring and Infinity-Gram

bull Frequencies of substrings that have a containment relationship always equal

bull In the model with linear combination of features it is possible to enclose the common feature values

bull Logistic regression with maximal substrings is equivalent to the one with infinity-grams

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Although the equivalence collapses for test set

we assumes that it can be approximated by a sufficiently large training set

31

Extended Suffix Array

bull Extended Suffix Array consists of

ndash SA=Suffix Array

ndash L=Longest Common Prefixes

ndash B=Burrows-Wheelers Transformed text

bull A maximal substring that occurs more than once corresponds to a internal node of Suffix Tree which is equivalent to a suffix with Lgt0 and BWT has more than 1 character type

ndash They can be calculated on linear time

bull esaxx Okanoharas implement of ESA

ndash httpcodegooglecompesaxx

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 32

via [Okanohara+ 09]

Corpus and Normalization

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 33

Target Languages

bull Limit character type to detect

ndash In short text detection mixed text can be

divided to type of characters

bull Latin alphabet language

ndash The most difficult alphabet type to detect

ndash Languages which speakers are over 5

million are more than 25

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 34

Whats Latin Alphabet

bull Latin alphabet ne ascii alphabet

ndash aring ą aelig eth Ħ ŋ and so on

bull They are assigned to 9 code blocks in Unicode

Range Name Supplement

U+0000-007F Basic Latin ascii

U+0080-00FF Latin-1 Supplement Most languages are covered with these U+0100-017F Latin Extended-A

U+0180-024F Latin Extended-B Rumanian

U+0250-02AF IPA Extensions

U+0300-036F Combining Diacritical Marks for tone symbol composition

U+1E00-1EFF Latin Extended Additional Vietnamese

U+2C60-2C7F Latin Extended-C These arenrsquot used by almost all present languages U+A720-A7FF Latin Extended-D

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 35

Latin Alphabets in Unicode Codepoint Chart

for Vietnamese only use often use sometimes

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 36

How to Create Corpus

bull Collect tweets with sample method of

twitter Streaming API

ndash Sampling 1 of all tweets (about 2

million tweets)

ndash Tweets in Latin alphabet language

account for 60 of them

bull The rest is only to annotate language

labels to these tweets

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 37

Language Label Annotation

bull Group tweets by their timezone

ndash French tweets account for about 1 of all ones

ndash But they account for 50 of ones in Paris

timezone only

bull Annotate tentative labels to tweets using

langdetect

ndash Remove non-French tweets from ones labeled lsquofrrsquo

ndash Recover French tweets from ones not labeled lsquofrrsquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 38

( 20 of the whole tweets have no timezone)

How to annotate

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 39

Swedish Norwegian Danish Vietnamese Lithuanian

Czech Hungarian Catalan Rumanian and Polish guides in turn

Created Corpus

bull Noiseless tweets for training data

bull Noiseful tweets with more than 3 words as test data

bull Work with Rauacutel Velaz and Hiroshi Manabe for Catalan corpus creation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 40

language training testca Catalan 9089 5082cs Czech 9082 7682da Dannish 7388 5524de German 44448 10065en English 44520 10168es Spanish 44118 10265fi Finnish 8087 7050fr French 44339 10098hu Hungarian 10030 4904id Indonesian 44722 10181it Italian 43366 10152nl Dutch 44682 10007no Norwegian 10124 8496pl Polish 16771 10152pt Portuguese 44215 10208ro Romanian 10021 5911sv Swedish 44054 10032tr Turkish 44703 10308vi Vietnamese 15030 10488

total 538789 166773

Simple Language Detection

bull Language detector can be constructed

from maximal substring model and

twitter corpus

ndash It still gets at most 98 accuracy

bull We guess it is necessary to reduce bias

ndash data size bias

ndash language-specific bias

ndash twitter-specific bias

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 41

Bias by Data Size

bull Tweet size in each language has huge bias

bull Level them out by sampling with replacement from each language up to the largest data

ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement

English

Portuguese

Spanish

Indonesian

Dutch

French

German

Turkish

Italian

Swedish

othersShort Text Language Detection with Infinity-Gram

(NAIST Seminar) 42

Convert to Lowercase on Multiple Languages

bull Conversion into lower case saves corpus and compresses model

bull But the lower case of I (U+0049) in Turkish differs from others

bull Convert to lower case excluding lsquoIrsquo

Upper case Lower case

Turkish

Azerbaijani

I (U+0049) ı (U+0131)

İ (U+0130) i (U+0069)

Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 43

Normalization for Rumanian

bull Rumanian uses acirc ă icirc ș ț in addition to a-z

bull There are 2 character type as st with a ldquobeardrdquo

ndash U+015E-F U+0162-3 st with cedilla

ndash U+0218-B st with comma below

bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia

bull The 2 code has the same design in some fonts

ndash Indistinguishable

ș ş U+0219 U+015F

ț ţ U+021B U+0163

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

44

Rumanian Character Affairs on PC

bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently

ndash 1989 Democratization in Rumania

ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode

ndash 2007 Rumania seated in the EU

ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 45

lsquost with cedillarsquo is used

on an advertisement board

in Bucharest

Normalization for Substitute Characters

bull lsquost with cedillarsquo are substitute characters

ndash But they are more popular than the others

ndash with cedilla with comma = 2 1

ndash ldquoRumanian IMErdquo outputs the substitutes too D

bull Regard lsquost with commarsquo as lsquost with cedillarsquo

ț ţ U+021B U+0163

I reckon it is similar to the relationship of

Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 46

Arabic Character Normalization (on language-detection)

bull Arabic and Persian have the similar trouble too

bull Character lsquoyehrsquo in Farsi corresponds to 2 code points

ndash Wikipedia uses ی (U+06cc Farsi yeh) only

ndash News uses ي(U+064a Arabic yeh) only

bull U+064a is a substitute in Farsi

ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc

ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails

bull Regard U+06cc as U+064a

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 47

Normalization for Vietnamese (1)

bull Vietnamese has 12 vowels

ndash a ă acirc e ecirc i y o ocirc ơ u ư

bull Vietnamese has 6 tones

ndash a ả agrave atilde aacute ạ

ndash These tone symbols are used also in general documents like news

bull The tone symbols can be appended to all vowels

ndash 12 6 = 72

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 48

Normalization for Vietnamese (2)

bull Representation of vowels with

tones

1 Use U+1ea0 - U+1ef9

bull ẵ = U+1eb5

2 Combine with Diacritical Marks

bull ẵ = U+0103 U+0303

ndash Half and half on news and tweet

bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 49

CJK-Kanji Normalization (1) (on language-detection)

bull CJK-Kanji has too many characters(more than 20K)

ndash Other character types have only 30-50 characters

bull The character space is very sparse

ndash Characters that donrsquot occur in the training corpus have no probabilities

bull eg 谢谢 Kanji for person name

ndash Common frequent characters are too strong

bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese

bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 50

CJK-Kanji Normalization (2) (on language-detection)

bull Group Kanjis by frequency and normalize each group to the representative character

ndash (1) K-means clustering

bull Use tf-idf on Wikipedia and Google News

bull K=50 (size of ascii alphabet = 52)

ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese

bull Simplified Chinese 现代汉语常用字表(3500)

bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)

bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998

ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much

bull Generate 130 clusters from product of (1) and (2)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 51

Normalization for twitter

bull Remove simply

ndash URL

ndash mention

ndash hash tag

ndash RT

ndash face mark using alphabet like XD p

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 52

Normalization for twitter-Specific Representation

bull How to Like lsquocoooooooollllllrsquo

bull Case 1 Make a normalization dictionary using [Brody+ 2011]

ndash Unsupervised normalization like coooollll rarr cool

ndash It canrsquot handle words that are not in the dictionary

bull Case 2 If the same character continues in more than 3 Shrink it to 2

ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of

bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on

bull Acronym (like WWW СССР) is not useful for language detection

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 53

Laugh Normalization

bull There are various laughs on each language

ndash HOW MUCH DO YOU LOVE COACH BEISTE

HHAHAHAHAHAH

ndash Hihihihi ) Habe ich regulaumlr 2x die Woche

ndash Tafil con eso Jajajajajajaja

ndash Malo Jejejeje XP

ndash kekeke chỗ đoacute lagravem aacuteo được ko em

bull Shrink them to double

ndash hahahha rArr haha

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 54

Implementation and Estimation

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 55

Language Detection with Infinity-Gram (ldig)

bull tweet language detection for Latin

alphabet

ndash httpsgithubcomshuyoldig

bull MIT license

bull Distribute also the trained model here

ndash infin-gram LR(maximal substring) [Okanohara+ 09]

ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]

ndash Double Array

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 56

Usage (1) Model Initialization

bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]

ndash Extract features from corpus and initialize model

ndash -m model directory

ndash -x path of maximal substring extractor (execute as external process)

ndash --ff Ignore less than the specified value

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 57

Maximal String Extractor

bull maxsubst [input file] [output file]

ndash Input as multiple line text

bull Replace TABs to ldquo ldquo line feeds to U+0001 in it

ndash Output as rdquo[features]yent[frequency]rdquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 58

Usage (2) Learn

bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]

ndash Learn the model using the corpus on 1 cycle of SGD

ndash -e learning rate of SGD

ndash -r regularizer of L1 regularization

ndash --wr what times to regularize for whole parameters

bull Parameters are too many to regularize the whle ones every step

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 59

Usage (3) Shrink Model

bull ldigpy -m [model] --shrink

ndash Remove Unefficient features(all

parameters of which are 0) from the

model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 60

Usage (4) Detect Language

bull ldigpy -m [model] [test data]

ndash Detect languages of test data and output

its result and summary

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 61

Data Format

bull Training and test data

ndash [correct label]yent[meta data]yent[text]

en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD

62

Usage (5) Estimation Tool

bull serverpy -m [model] -p [port number]

ndash Open httplocalhost[port] after it is executed

ndash Output their language probabilities contained features and their parameters for a text inputed in the text area

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 63

Estimation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus

As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size

64

language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992

total 166711 165053 9901 922 974

Estimation for LIGA dataset

bull Estimate using LIGA[Tromp+ 11] dataset

with 9066 tweets for 6 languages

ndash httpwwwwintuenl~mpechenprojectssmm

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Use 19 language model

65

Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996

total 9066 8992 992

Estimation for Europarl Dataset

Only supported languages for ldig

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 66

ldig langdetect CLDlanguage size correct rate correct rate correct rate

bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993

total 21000 13957 997 20850 993 20814 991

Conclusions

bull Language detector using maximal substring model

ndash Detect over 99 accuracy for 19 languages

ndash langdetect with tweet corpus even has 97 accuracy

bull If the corpus is maintained the precision will be still up

ndash There are still many mistakes (in particular da and no)

bull If metadata is added to features the precision will be still up

ndash How to add and train metadata at low cost

bull Desire to shrink the model without loss of precision

ndash Too large for application (gt100MB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 67

References

bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定

bull [Okanohara+ 09] Text Categorization with All Substring Features

bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs

bull [Cavnar+ 94] N-Gram-Based Text Categorization

bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 68

Page 10: Short Text Language Detection with Infinity-Gram

Language Detection with Naive Bayes Classifier

bull Document categorization with language

labels

ndash Categorize documents into English Japanese

and so on

bull Use character n-gram as features

ndash Unicode code point n-gram strictly speaking

ndash Assume character encoding of the document is

already known

bull Most applications know encoding of inside text data

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 12

Why Use n-Gram to Detect Language

bull Each language has proper characters and spelling rules

ndash ldquoeacuterdquo is often used in Spanish Italian and so on but not in English in principle

ndash There are many words which start with ldquoZrdquo in German but not in English

ndash There are many words which start with ldquoCrdquo in English but not in German

ndash Spelling ldquoThrdquo is often used in English but not in the other languages

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 13

T h i s T h i s larr1-gram

T Th hi is s larr2-gram

Th Thi his is larr3-gram

C L Z Th

English 075 047 002 074

German 010 037 053 003

French 038 069 001 001

language-detection(langdetect) (Nakatani 2010)

bull Language detection library for Java

ndash httpcodegooglecomplanguage-detection

ndash Apache License 20

ndash Character 3-gram + Bayesian filter

ndash Various normalizations + Feature sampling

bull 99 over precision for 53 languages

ndash Training with Wikipedia abstract

ndash Widely support including Asian languages

ndash Adopted by Apache Solr

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 14

Estimation with News Text

bull Test for crawled news text from web in 49 languages Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 15

Language size accuracyaf Afrikaans 200 199 (9950)ar Arabic 200 200 (10000)bg Bulgarian 200 200 (10000)bn Bengali 200 200 (10000)cs Czech 200 200 (10000)da Dannish 200 179 (8950)de German 200 200 (10000)el Greek 200 200 (10000)en English 200 200 (10000)es Spanish 200 200 (10000)fa Persian 200 200 (10000)fi Finnish 200 200 (10000)fr French 200 200 (10000)gu Gujarati 200 200 (10000)he Hebrew 200 200 (10000)hi Hindi 200 200 (10000)hr Croatian 200 200 (10000)hu Hungarian 200 200 (10000)id Indonesian 200 200 (10000)it Italian 200 200 (10000)ja Japanese 200 200 (10000)kn Kannada 200 200 (10000)ko Korean 200 200 (10000)mk Macedonian 200 200 (10000)ml Malayalam 200 200 (10000)

Language size accuracymr Marathi 200 200 (10000)ne Nepali 200 200 (10000)nl Dutch 200 200 (10000)no Norwegian 200 199 (9950)pa Punjabi 200 200 (10000)pl Polish 200 200 (10000)pt Portuguese 200 200 (10000)ro Romanian 200 200 (10000)ru Russian 200 200 (10000)sk Slovak 200 200 (10000)so Somali 200 200 (10000)sq Albanian 200 200 (10000)sv Swedish 200 200 (10000)sw Swahili 200 200 (10000)ta Tamil 200 200 (10000)te Telugu 200 200 (10000)th Thai 200 200 (10000)tl Tagalog 200 200 (10000)tr Turkish 200 200 (10000)uk Ukrainian 200 200 (10000)ur Urdu 200 200 (10000)vi Vietnamese 200 200 (10000)

zh-cn Simplified Chinese 200 200 (10000)zh-tw Traditional Chinese 200 200 (10000)

total 9800 9777 (9977)

Estimation with Europarl datasets

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 16

bull Test for 1000 samples for each

language from Europarl Parallel Corpus

ndash from the proceedings of the European Parliament

ndash httpwwwstatmtorgeuroparl

bull httpcodegooglecomplanguage-

detectiondownloadsdetailname=eur

oparl-testzip

language size correct accuracybg Bulgarian 1000 988 988cs Czech 1000 994 994da Dannish 1000 968 968de German 1000 998 998el Greek 1000 1000 1000en English 1000 996 996es Spanish 1000 996 996et Estonian 1000 996 996fi Finnish 1000 998 998fr French 1000 999 999hu Hungarian 1000 999 999it Italian 1000 999 999lt Lithuanian 1000 997 997lv Latvian 1000 999 999nl Dutch 1000 974 974pl Polish 1000 999 999pt Portuguese 1000 996 996ro Romanian 1000 999 999sk Slovak 1000 988 988sl Slovene 1000 976 976sv Swedish 1000 991 991

total 21000 20850 993

Language Detection has been over isnt it

17

We still have ENEMY to beat

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 18

Twitter Language Detection with the Existing Methods

bull Only 90-95 accuracy

for tweet corpus

bull LD = language-detection

bull CLD = Chromium Compact Language

Detection

ndash httpcodegooglecompchromium-

compact-language-detector

ndash regard ms(Malay) as id(Indonesian)

bull Tika = Apache Tika

ndash httptikaapacheorg

ndash Estimate on 15 languages which Tika

supports in our tweet corpus

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 19

language LD CLD Tikaca Catalan 953 930 838cs Czech 963 966 ----da Dannish 945 907 587de German 866 968 731en English 883 974 547es Spanish 915 905 444fi Finnish 989 994 948fr French 950 945 674hu Hungarian 858 890 762id Indonesian 897 928 ----it Italian 962 938 871nl Dutch 695 932 650no Norwegian 960 749 686pl Polish 980 978 888pt Portuguese 880 886 474ro Romanian 928 961 826sv Swedish 960 964 756tr Turkish 976 974 ----vi Vietnamese 987 989 ----

total 922 938 700

Chromium Compact Language Detection (CLD)

bull Porting the language detector from

Google Chromium ndash httpcodegooglecompchromium-compact-language-detector

ndash Implementation in C++ Python binding

ndash of supported languages CLD = 76

langdetect = 53

ndash Accuracy CLD = 9882 langdetect =

9922

bull for 17 languages on Europarl datasets bull httpblogmikemccandlesscom201110accuracy-and-performance-of-googleshtml

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 20

Is twitter Language Detection difficult (1)

bull Tweet is too short to extract 3-gram features

ndash At most 140 characters on twitter

ndash URLs mentions and hashtags are not useful to

detect

bull LIGA [Tromp+ 11]

ndash Graph-features based on 3-gram

bull Add long distance features

bull 95~98 accuracy for twitter Language Detection

bull 6 languages (de en es fr it nl)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 21

Is twitter Language Detection difficult (2)

bull Tweet is too noisy

ndash Representations against the languages orthography often appear

ndash Acronym Abbreviation lengthened word (like Cooooolll)

bull Likelihood of tweet tends to get smaller on normal language model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

OMG Oh My God

LOL Laughing Out Loud

LMAO Laughing My Ass Out

F4F Follow for Follow

MDR Mort de Rire (French)

TKT Ne tlsquoInquiegravete Pas (Fr)

u you

ur your

4 for

i0u I love you

k che (Italian)

anke anche(Italian)

Letter k isnt used in Italian

22

Motivation to Detect Short Text Language

bull There are many small chunks of text in addition to twitter

ndash Schedule search query bulletin board and so on

ndash There are many questions about short text detection in the Issues Board of langdetect Project

bull httpcodegooglecomplanguage-detectionissuesdetailid=10

bull Detection for multi-language mixed text

ndash Cut the target document in paragraphs or lines

ndash Detect for each short text

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 23

Our Goal

bull Over 99 accuracy

ndash However it is too difficult to detect one

word sentence

ndash Our Goal is 99+ accurate detection for

sentence with more than 3 words

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 24

We need

bull Rich feature extractable model from

short text

ndash Maximal substring model

(infin-gram Logistic Regression)

bull and twitter-specific Language model

or Corpus to construct it

ndash about 700K tweet corpus with language

label

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 25

Proposal Method

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 26

How to increase features from 3-grams

bull The more n the more features

bull Maximum at n=infin that is all substring

ndash But it has O(T2) order

gram of n-gram

freq≧1 freq≧2 freq≧10

1 79 72 57

2 1896 1533 902

3 15970 10369 4525

4 64966 33941 10534

5 167543 69719 15538

6 323749 107861 18970

7 524634 142954 21093

8 760719 171995 22159

9 921361 193995 22696

cumulative distributuion of feature length for 5090 normalized English tweets (300KB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 27

Text Categorization with All Substring Features [Okanohara+ 09]

bull Multiclass Logistic Regression using all

substrings as features

ndash Maximal Substring makes the equivalent

model that can be constructed in linear

time

ndash Store features into TRIE fast prediction

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 28

Maximal Substring (1)

bull Define a containment(semi-order)

among non empty substrings

abracadabra

ndash ldquorardquo sub ldquobraldquo hArr all rdquorardquo occur

as the substring of ldquobrardquo

ndash ldquoardquo nsub ldquoraldquo hArr ldquoardquo occur in not only ldquoraldquo

but also ldquocardquo It is strictly defined with also its position in the substring

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 29

Maximal Substring (2)

bull Each equivalent class formed by the containment relationship has a unique maximal element that is named Maximal Substring

bull Maximal substrings of abracadabra are a abra and abracadabra

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 30

via httpdhatenanejpnokuno201202031328237067

Maximal Substring and Infinity-Gram

bull Frequencies of substrings that have a containment relationship always equal

bull In the model with linear combination of features it is possible to enclose the common feature values

bull Logistic regression with maximal substrings is equivalent to the one with infinity-grams

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Although the equivalence collapses for test set

we assumes that it can be approximated by a sufficiently large training set

31

Extended Suffix Array

bull Extended Suffix Array consists of

ndash SA=Suffix Array

ndash L=Longest Common Prefixes

ndash B=Burrows-Wheelers Transformed text

bull A maximal substring that occurs more than once corresponds to a internal node of Suffix Tree which is equivalent to a suffix with Lgt0 and BWT has more than 1 character type

ndash They can be calculated on linear time

bull esaxx Okanoharas implement of ESA

ndash httpcodegooglecompesaxx

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 32

via [Okanohara+ 09]

Corpus and Normalization

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 33

Target Languages

bull Limit character type to detect

ndash In short text detection mixed text can be

divided to type of characters

bull Latin alphabet language

ndash The most difficult alphabet type to detect

ndash Languages which speakers are over 5

million are more than 25

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 34

Whats Latin Alphabet

bull Latin alphabet ne ascii alphabet

ndash aring ą aelig eth Ħ ŋ and so on

bull They are assigned to 9 code blocks in Unicode

Range Name Supplement

U+0000-007F Basic Latin ascii

U+0080-00FF Latin-1 Supplement Most languages are covered with these U+0100-017F Latin Extended-A

U+0180-024F Latin Extended-B Rumanian

U+0250-02AF IPA Extensions

U+0300-036F Combining Diacritical Marks for tone symbol composition

U+1E00-1EFF Latin Extended Additional Vietnamese

U+2C60-2C7F Latin Extended-C These arenrsquot used by almost all present languages U+A720-A7FF Latin Extended-D

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 35

Latin Alphabets in Unicode Codepoint Chart

for Vietnamese only use often use sometimes

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 36

How to Create Corpus

bull Collect tweets with sample method of

twitter Streaming API

ndash Sampling 1 of all tweets (about 2

million tweets)

ndash Tweets in Latin alphabet language

account for 60 of them

bull The rest is only to annotate language

labels to these tweets

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 37

Language Label Annotation

bull Group tweets by their timezone

ndash French tweets account for about 1 of all ones

ndash But they account for 50 of ones in Paris

timezone only

bull Annotate tentative labels to tweets using

langdetect

ndash Remove non-French tweets from ones labeled lsquofrrsquo

ndash Recover French tweets from ones not labeled lsquofrrsquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 38

( 20 of the whole tweets have no timezone)

How to annotate

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 39

Swedish Norwegian Danish Vietnamese Lithuanian

Czech Hungarian Catalan Rumanian and Polish guides in turn

Created Corpus

bull Noiseless tweets for training data

bull Noiseful tweets with more than 3 words as test data

bull Work with Rauacutel Velaz and Hiroshi Manabe for Catalan corpus creation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 40

language training testca Catalan 9089 5082cs Czech 9082 7682da Dannish 7388 5524de German 44448 10065en English 44520 10168es Spanish 44118 10265fi Finnish 8087 7050fr French 44339 10098hu Hungarian 10030 4904id Indonesian 44722 10181it Italian 43366 10152nl Dutch 44682 10007no Norwegian 10124 8496pl Polish 16771 10152pt Portuguese 44215 10208ro Romanian 10021 5911sv Swedish 44054 10032tr Turkish 44703 10308vi Vietnamese 15030 10488

total 538789 166773

Simple Language Detection

bull Language detector can be constructed

from maximal substring model and

twitter corpus

ndash It still gets at most 98 accuracy

bull We guess it is necessary to reduce bias

ndash data size bias

ndash language-specific bias

ndash twitter-specific bias

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 41

Bias by Data Size

bull Tweet size in each language has huge bias

bull Level them out by sampling with replacement from each language up to the largest data

ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement

English

Portuguese

Spanish

Indonesian

Dutch

French

German

Turkish

Italian

Swedish

othersShort Text Language Detection with Infinity-Gram

(NAIST Seminar) 42

Convert to Lowercase on Multiple Languages

bull Conversion into lower case saves corpus and compresses model

bull But the lower case of I (U+0049) in Turkish differs from others

bull Convert to lower case excluding lsquoIrsquo

Upper case Lower case

Turkish

Azerbaijani

I (U+0049) ı (U+0131)

İ (U+0130) i (U+0069)

Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 43

Normalization for Rumanian

bull Rumanian uses acirc ă icirc ș ț in addition to a-z

bull There are 2 character type as st with a ldquobeardrdquo

ndash U+015E-F U+0162-3 st with cedilla

ndash U+0218-B st with comma below

bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia

bull The 2 code has the same design in some fonts

ndash Indistinguishable

ș ş U+0219 U+015F

ț ţ U+021B U+0163

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

44

Rumanian Character Affairs on PC

bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently

ndash 1989 Democratization in Rumania

ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode

ndash 2007 Rumania seated in the EU

ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 45

lsquost with cedillarsquo is used

on an advertisement board

in Bucharest

Normalization for Substitute Characters

bull lsquost with cedillarsquo are substitute characters

ndash But they are more popular than the others

ndash with cedilla with comma = 2 1

ndash ldquoRumanian IMErdquo outputs the substitutes too D

bull Regard lsquost with commarsquo as lsquost with cedillarsquo

ț ţ U+021B U+0163

I reckon it is similar to the relationship of

Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 46

Arabic Character Normalization (on language-detection)

bull Arabic and Persian have the similar trouble too

bull Character lsquoyehrsquo in Farsi corresponds to 2 code points

ndash Wikipedia uses ی (U+06cc Farsi yeh) only

ndash News uses ي(U+064a Arabic yeh) only

bull U+064a is a substitute in Farsi

ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc

ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails

bull Regard U+06cc as U+064a

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 47

Normalization for Vietnamese (1)

bull Vietnamese has 12 vowels

ndash a ă acirc e ecirc i y o ocirc ơ u ư

bull Vietnamese has 6 tones

ndash a ả agrave atilde aacute ạ

ndash These tone symbols are used also in general documents like news

bull The tone symbols can be appended to all vowels

ndash 12 6 = 72

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 48

Normalization for Vietnamese (2)

bull Representation of vowels with

tones

1 Use U+1ea0 - U+1ef9

bull ẵ = U+1eb5

2 Combine with Diacritical Marks

bull ẵ = U+0103 U+0303

ndash Half and half on news and tweet

bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 49

CJK-Kanji Normalization (1) (on language-detection)

bull CJK-Kanji has too many characters(more than 20K)

ndash Other character types have only 30-50 characters

bull The character space is very sparse

ndash Characters that donrsquot occur in the training corpus have no probabilities

bull eg 谢谢 Kanji for person name

ndash Common frequent characters are too strong

bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese

bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 50

CJK-Kanji Normalization (2) (on language-detection)

bull Group Kanjis by frequency and normalize each group to the representative character

ndash (1) K-means clustering

bull Use tf-idf on Wikipedia and Google News

bull K=50 (size of ascii alphabet = 52)

ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese

bull Simplified Chinese 现代汉语常用字表(3500)

bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)

bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998

ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much

bull Generate 130 clusters from product of (1) and (2)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 51

Normalization for twitter

bull Remove simply

ndash URL

ndash mention

ndash hash tag

ndash RT

ndash face mark using alphabet like XD p

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 52

Normalization for twitter-Specific Representation

bull How to Like lsquocoooooooollllllrsquo

bull Case 1 Make a normalization dictionary using [Brody+ 2011]

ndash Unsupervised normalization like coooollll rarr cool

ndash It canrsquot handle words that are not in the dictionary

bull Case 2 If the same character continues in more than 3 Shrink it to 2

ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of

bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on

bull Acronym (like WWW СССР) is not useful for language detection

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 53

Laugh Normalization

bull There are various laughs on each language

ndash HOW MUCH DO YOU LOVE COACH BEISTE

HHAHAHAHAHAH

ndash Hihihihi ) Habe ich regulaumlr 2x die Woche

ndash Tafil con eso Jajajajajajaja

ndash Malo Jejejeje XP

ndash kekeke chỗ đoacute lagravem aacuteo được ko em

bull Shrink them to double

ndash hahahha rArr haha

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 54

Implementation and Estimation

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 55

Language Detection with Infinity-Gram (ldig)

bull tweet language detection for Latin

alphabet

ndash httpsgithubcomshuyoldig

bull MIT license

bull Distribute also the trained model here

ndash infin-gram LR(maximal substring) [Okanohara+ 09]

ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]

ndash Double Array

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 56

Usage (1) Model Initialization

bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]

ndash Extract features from corpus and initialize model

ndash -m model directory

ndash -x path of maximal substring extractor (execute as external process)

ndash --ff Ignore less than the specified value

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 57

Maximal String Extractor

bull maxsubst [input file] [output file]

ndash Input as multiple line text

bull Replace TABs to ldquo ldquo line feeds to U+0001 in it

ndash Output as rdquo[features]yent[frequency]rdquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 58

Usage (2) Learn

bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]

ndash Learn the model using the corpus on 1 cycle of SGD

ndash -e learning rate of SGD

ndash -r regularizer of L1 regularization

ndash --wr what times to regularize for whole parameters

bull Parameters are too many to regularize the whle ones every step

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 59

Usage (3) Shrink Model

bull ldigpy -m [model] --shrink

ndash Remove Unefficient features(all

parameters of which are 0) from the

model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 60

Usage (4) Detect Language

bull ldigpy -m [model] [test data]

ndash Detect languages of test data and output

its result and summary

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 61

Data Format

bull Training and test data

ndash [correct label]yent[meta data]yent[text]

en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD

62

Usage (5) Estimation Tool

bull serverpy -m [model] -p [port number]

ndash Open httplocalhost[port] after it is executed

ndash Output their language probabilities contained features and their parameters for a text inputed in the text area

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 63

Estimation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus

As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size

64

language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992

total 166711 165053 9901 922 974

Estimation for LIGA dataset

bull Estimate using LIGA[Tromp+ 11] dataset

with 9066 tweets for 6 languages

ndash httpwwwwintuenl~mpechenprojectssmm

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Use 19 language model

65

Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996

total 9066 8992 992

Estimation for Europarl Dataset

Only supported languages for ldig

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 66

ldig langdetect CLDlanguage size correct rate correct rate correct rate

bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993

total 21000 13957 997 20850 993 20814 991

Conclusions

bull Language detector using maximal substring model

ndash Detect over 99 accuracy for 19 languages

ndash langdetect with tweet corpus even has 97 accuracy

bull If the corpus is maintained the precision will be still up

ndash There are still many mistakes (in particular da and no)

bull If metadata is added to features the precision will be still up

ndash How to add and train metadata at low cost

bull Desire to shrink the model without loss of precision

ndash Too large for application (gt100MB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 67

References

bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定

bull [Okanohara+ 09] Text Categorization with All Substring Features

bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs

bull [Cavnar+ 94] N-Gram-Based Text Categorization

bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 68

Page 11: Short Text Language Detection with Infinity-Gram

Why Use n-Gram to Detect Language

bull Each language has proper characters and spelling rules

ndash ldquoeacuterdquo is often used in Spanish Italian and so on but not in English in principle

ndash There are many words which start with ldquoZrdquo in German but not in English

ndash There are many words which start with ldquoCrdquo in English but not in German

ndash Spelling ldquoThrdquo is often used in English but not in the other languages

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 13

T h i s T h i s larr1-gram

T Th hi is s larr2-gram

Th Thi his is larr3-gram

C L Z Th

English 075 047 002 074

German 010 037 053 003

French 038 069 001 001

language-detection(langdetect) (Nakatani 2010)

bull Language detection library for Java

ndash httpcodegooglecomplanguage-detection

ndash Apache License 20

ndash Character 3-gram + Bayesian filter

ndash Various normalizations + Feature sampling

bull 99 over precision for 53 languages

ndash Training with Wikipedia abstract

ndash Widely support including Asian languages

ndash Adopted by Apache Solr

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 14

Estimation with News Text

bull Test for crawled news text from web in 49 languages Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 15

Language size accuracyaf Afrikaans 200 199 (9950)ar Arabic 200 200 (10000)bg Bulgarian 200 200 (10000)bn Bengali 200 200 (10000)cs Czech 200 200 (10000)da Dannish 200 179 (8950)de German 200 200 (10000)el Greek 200 200 (10000)en English 200 200 (10000)es Spanish 200 200 (10000)fa Persian 200 200 (10000)fi Finnish 200 200 (10000)fr French 200 200 (10000)gu Gujarati 200 200 (10000)he Hebrew 200 200 (10000)hi Hindi 200 200 (10000)hr Croatian 200 200 (10000)hu Hungarian 200 200 (10000)id Indonesian 200 200 (10000)it Italian 200 200 (10000)ja Japanese 200 200 (10000)kn Kannada 200 200 (10000)ko Korean 200 200 (10000)mk Macedonian 200 200 (10000)ml Malayalam 200 200 (10000)

Language size accuracymr Marathi 200 200 (10000)ne Nepali 200 200 (10000)nl Dutch 200 200 (10000)no Norwegian 200 199 (9950)pa Punjabi 200 200 (10000)pl Polish 200 200 (10000)pt Portuguese 200 200 (10000)ro Romanian 200 200 (10000)ru Russian 200 200 (10000)sk Slovak 200 200 (10000)so Somali 200 200 (10000)sq Albanian 200 200 (10000)sv Swedish 200 200 (10000)sw Swahili 200 200 (10000)ta Tamil 200 200 (10000)te Telugu 200 200 (10000)th Thai 200 200 (10000)tl Tagalog 200 200 (10000)tr Turkish 200 200 (10000)uk Ukrainian 200 200 (10000)ur Urdu 200 200 (10000)vi Vietnamese 200 200 (10000)

zh-cn Simplified Chinese 200 200 (10000)zh-tw Traditional Chinese 200 200 (10000)

total 9800 9777 (9977)

Estimation with Europarl datasets

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 16

bull Test for 1000 samples for each

language from Europarl Parallel Corpus

ndash from the proceedings of the European Parliament

ndash httpwwwstatmtorgeuroparl

bull httpcodegooglecomplanguage-

detectiondownloadsdetailname=eur

oparl-testzip

language size correct accuracybg Bulgarian 1000 988 988cs Czech 1000 994 994da Dannish 1000 968 968de German 1000 998 998el Greek 1000 1000 1000en English 1000 996 996es Spanish 1000 996 996et Estonian 1000 996 996fi Finnish 1000 998 998fr French 1000 999 999hu Hungarian 1000 999 999it Italian 1000 999 999lt Lithuanian 1000 997 997lv Latvian 1000 999 999nl Dutch 1000 974 974pl Polish 1000 999 999pt Portuguese 1000 996 996ro Romanian 1000 999 999sk Slovak 1000 988 988sl Slovene 1000 976 976sv Swedish 1000 991 991

total 21000 20850 993

Language Detection has been over isnt it

17

We still have ENEMY to beat

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 18

Twitter Language Detection with the Existing Methods

bull Only 90-95 accuracy

for tweet corpus

bull LD = language-detection

bull CLD = Chromium Compact Language

Detection

ndash httpcodegooglecompchromium-

compact-language-detector

ndash regard ms(Malay) as id(Indonesian)

bull Tika = Apache Tika

ndash httptikaapacheorg

ndash Estimate on 15 languages which Tika

supports in our tweet corpus

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 19

language LD CLD Tikaca Catalan 953 930 838cs Czech 963 966 ----da Dannish 945 907 587de German 866 968 731en English 883 974 547es Spanish 915 905 444fi Finnish 989 994 948fr French 950 945 674hu Hungarian 858 890 762id Indonesian 897 928 ----it Italian 962 938 871nl Dutch 695 932 650no Norwegian 960 749 686pl Polish 980 978 888pt Portuguese 880 886 474ro Romanian 928 961 826sv Swedish 960 964 756tr Turkish 976 974 ----vi Vietnamese 987 989 ----

total 922 938 700

Chromium Compact Language Detection (CLD)

bull Porting the language detector from

Google Chromium ndash httpcodegooglecompchromium-compact-language-detector

ndash Implementation in C++ Python binding

ndash of supported languages CLD = 76

langdetect = 53

ndash Accuracy CLD = 9882 langdetect =

9922

bull for 17 languages on Europarl datasets bull httpblogmikemccandlesscom201110accuracy-and-performance-of-googleshtml

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 20

Is twitter Language Detection difficult (1)

bull Tweet is too short to extract 3-gram features

ndash At most 140 characters on twitter

ndash URLs mentions and hashtags are not useful to

detect

bull LIGA [Tromp+ 11]

ndash Graph-features based on 3-gram

bull Add long distance features

bull 95~98 accuracy for twitter Language Detection

bull 6 languages (de en es fr it nl)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 21

Is twitter Language Detection difficult (2)

bull Tweet is too noisy

ndash Representations against the languages orthography often appear

ndash Acronym Abbreviation lengthened word (like Cooooolll)

bull Likelihood of tweet tends to get smaller on normal language model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

OMG Oh My God

LOL Laughing Out Loud

LMAO Laughing My Ass Out

F4F Follow for Follow

MDR Mort de Rire (French)

TKT Ne tlsquoInquiegravete Pas (Fr)

u you

ur your

4 for

i0u I love you

k che (Italian)

anke anche(Italian)

Letter k isnt used in Italian

22

Motivation to Detect Short Text Language

bull There are many small chunks of text in addition to twitter

ndash Schedule search query bulletin board and so on

ndash There are many questions about short text detection in the Issues Board of langdetect Project

bull httpcodegooglecomplanguage-detectionissuesdetailid=10

bull Detection for multi-language mixed text

ndash Cut the target document in paragraphs or lines

ndash Detect for each short text

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 23

Our Goal

bull Over 99 accuracy

ndash However it is too difficult to detect one

word sentence

ndash Our Goal is 99+ accurate detection for

sentence with more than 3 words

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 24

We need

bull Rich feature extractable model from

short text

ndash Maximal substring model

(infin-gram Logistic Regression)

bull and twitter-specific Language model

or Corpus to construct it

ndash about 700K tweet corpus with language

label

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 25

Proposal Method

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 26

How to increase features from 3-grams

bull The more n the more features

bull Maximum at n=infin that is all substring

ndash But it has O(T2) order

gram of n-gram

freq≧1 freq≧2 freq≧10

1 79 72 57

2 1896 1533 902

3 15970 10369 4525

4 64966 33941 10534

5 167543 69719 15538

6 323749 107861 18970

7 524634 142954 21093

8 760719 171995 22159

9 921361 193995 22696

cumulative distributuion of feature length for 5090 normalized English tweets (300KB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 27

Text Categorization with All Substring Features [Okanohara+ 09]

bull Multiclass Logistic Regression using all

substrings as features

ndash Maximal Substring makes the equivalent

model that can be constructed in linear

time

ndash Store features into TRIE fast prediction

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 28

Maximal Substring (1)

bull Define a containment(semi-order)

among non empty substrings

abracadabra

ndash ldquorardquo sub ldquobraldquo hArr all rdquorardquo occur

as the substring of ldquobrardquo

ndash ldquoardquo nsub ldquoraldquo hArr ldquoardquo occur in not only ldquoraldquo

but also ldquocardquo It is strictly defined with also its position in the substring

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 29

Maximal Substring (2)

bull Each equivalent class formed by the containment relationship has a unique maximal element that is named Maximal Substring

bull Maximal substrings of abracadabra are a abra and abracadabra

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 30

via httpdhatenanejpnokuno201202031328237067

Maximal Substring and Infinity-Gram

bull Frequencies of substrings that have a containment relationship always equal

bull In the model with linear combination of features it is possible to enclose the common feature values

bull Logistic regression with maximal substrings is equivalent to the one with infinity-grams

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Although the equivalence collapses for test set

we assumes that it can be approximated by a sufficiently large training set

31

Extended Suffix Array

bull Extended Suffix Array consists of

ndash SA=Suffix Array

ndash L=Longest Common Prefixes

ndash B=Burrows-Wheelers Transformed text

bull A maximal substring that occurs more than once corresponds to a internal node of Suffix Tree which is equivalent to a suffix with Lgt0 and BWT has more than 1 character type

ndash They can be calculated on linear time

bull esaxx Okanoharas implement of ESA

ndash httpcodegooglecompesaxx

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 32

via [Okanohara+ 09]

Corpus and Normalization

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 33

Target Languages

bull Limit character type to detect

ndash In short text detection mixed text can be

divided to type of characters

bull Latin alphabet language

ndash The most difficult alphabet type to detect

ndash Languages which speakers are over 5

million are more than 25

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 34

Whats Latin Alphabet

bull Latin alphabet ne ascii alphabet

ndash aring ą aelig eth Ħ ŋ and so on

bull They are assigned to 9 code blocks in Unicode

Range Name Supplement

U+0000-007F Basic Latin ascii

U+0080-00FF Latin-1 Supplement Most languages are covered with these U+0100-017F Latin Extended-A

U+0180-024F Latin Extended-B Rumanian

U+0250-02AF IPA Extensions

U+0300-036F Combining Diacritical Marks for tone symbol composition

U+1E00-1EFF Latin Extended Additional Vietnamese

U+2C60-2C7F Latin Extended-C These arenrsquot used by almost all present languages U+A720-A7FF Latin Extended-D

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 35

Latin Alphabets in Unicode Codepoint Chart

for Vietnamese only use often use sometimes

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 36

How to Create Corpus

bull Collect tweets with sample method of

twitter Streaming API

ndash Sampling 1 of all tweets (about 2

million tweets)

ndash Tweets in Latin alphabet language

account for 60 of them

bull The rest is only to annotate language

labels to these tweets

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 37

Language Label Annotation

bull Group tweets by their timezone

ndash French tweets account for about 1 of all ones

ndash But they account for 50 of ones in Paris

timezone only

bull Annotate tentative labels to tweets using

langdetect

ndash Remove non-French tweets from ones labeled lsquofrrsquo

ndash Recover French tweets from ones not labeled lsquofrrsquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 38

( 20 of the whole tweets have no timezone)

How to annotate

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 39

Swedish Norwegian Danish Vietnamese Lithuanian

Czech Hungarian Catalan Rumanian and Polish guides in turn

Created Corpus

bull Noiseless tweets for training data

bull Noiseful tweets with more than 3 words as test data

bull Work with Rauacutel Velaz and Hiroshi Manabe for Catalan corpus creation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 40

language training testca Catalan 9089 5082cs Czech 9082 7682da Dannish 7388 5524de German 44448 10065en English 44520 10168es Spanish 44118 10265fi Finnish 8087 7050fr French 44339 10098hu Hungarian 10030 4904id Indonesian 44722 10181it Italian 43366 10152nl Dutch 44682 10007no Norwegian 10124 8496pl Polish 16771 10152pt Portuguese 44215 10208ro Romanian 10021 5911sv Swedish 44054 10032tr Turkish 44703 10308vi Vietnamese 15030 10488

total 538789 166773

Simple Language Detection

bull Language detector can be constructed

from maximal substring model and

twitter corpus

ndash It still gets at most 98 accuracy

bull We guess it is necessary to reduce bias

ndash data size bias

ndash language-specific bias

ndash twitter-specific bias

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 41

Bias by Data Size

bull Tweet size in each language has huge bias

bull Level them out by sampling with replacement from each language up to the largest data

ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement

English

Portuguese

Spanish

Indonesian

Dutch

French

German

Turkish

Italian

Swedish

othersShort Text Language Detection with Infinity-Gram

(NAIST Seminar) 42

Convert to Lowercase on Multiple Languages

bull Conversion into lower case saves corpus and compresses model

bull But the lower case of I (U+0049) in Turkish differs from others

bull Convert to lower case excluding lsquoIrsquo

Upper case Lower case

Turkish

Azerbaijani

I (U+0049) ı (U+0131)

İ (U+0130) i (U+0069)

Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 43

Normalization for Rumanian

bull Rumanian uses acirc ă icirc ș ț in addition to a-z

bull There are 2 character type as st with a ldquobeardrdquo

ndash U+015E-F U+0162-3 st with cedilla

ndash U+0218-B st with comma below

bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia

bull The 2 code has the same design in some fonts

ndash Indistinguishable

ș ş U+0219 U+015F

ț ţ U+021B U+0163

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

44

Rumanian Character Affairs on PC

bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently

ndash 1989 Democratization in Rumania

ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode

ndash 2007 Rumania seated in the EU

ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 45

lsquost with cedillarsquo is used

on an advertisement board

in Bucharest

Normalization for Substitute Characters

bull lsquost with cedillarsquo are substitute characters

ndash But they are more popular than the others

ndash with cedilla with comma = 2 1

ndash ldquoRumanian IMErdquo outputs the substitutes too D

bull Regard lsquost with commarsquo as lsquost with cedillarsquo

ț ţ U+021B U+0163

I reckon it is similar to the relationship of

Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 46

Arabic Character Normalization (on language-detection)

bull Arabic and Persian have the similar trouble too

bull Character lsquoyehrsquo in Farsi corresponds to 2 code points

ndash Wikipedia uses ی (U+06cc Farsi yeh) only

ndash News uses ي(U+064a Arabic yeh) only

bull U+064a is a substitute in Farsi

ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc

ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails

bull Regard U+06cc as U+064a

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 47

Normalization for Vietnamese (1)

bull Vietnamese has 12 vowels

ndash a ă acirc e ecirc i y o ocirc ơ u ư

bull Vietnamese has 6 tones

ndash a ả agrave atilde aacute ạ

ndash These tone symbols are used also in general documents like news

bull The tone symbols can be appended to all vowels

ndash 12 6 = 72

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 48

Normalization for Vietnamese (2)

bull Representation of vowels with

tones

1 Use U+1ea0 - U+1ef9

bull ẵ = U+1eb5

2 Combine with Diacritical Marks

bull ẵ = U+0103 U+0303

ndash Half and half on news and tweet

bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 49

CJK-Kanji Normalization (1) (on language-detection)

bull CJK-Kanji has too many characters(more than 20K)

ndash Other character types have only 30-50 characters

bull The character space is very sparse

ndash Characters that donrsquot occur in the training corpus have no probabilities

bull eg 谢谢 Kanji for person name

ndash Common frequent characters are too strong

bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese

bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 50

CJK-Kanji Normalization (2) (on language-detection)

bull Group Kanjis by frequency and normalize each group to the representative character

ndash (1) K-means clustering

bull Use tf-idf on Wikipedia and Google News

bull K=50 (size of ascii alphabet = 52)

ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese

bull Simplified Chinese 现代汉语常用字表(3500)

bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)

bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998

ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much

bull Generate 130 clusters from product of (1) and (2)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 51

Normalization for twitter

bull Remove simply

ndash URL

ndash mention

ndash hash tag

ndash RT

ndash face mark using alphabet like XD p

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 52

Normalization for twitter-Specific Representation

bull How to Like lsquocoooooooollllllrsquo

bull Case 1 Make a normalization dictionary using [Brody+ 2011]

ndash Unsupervised normalization like coooollll rarr cool

ndash It canrsquot handle words that are not in the dictionary

bull Case 2 If the same character continues in more than 3 Shrink it to 2

ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of

bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on

bull Acronym (like WWW СССР) is not useful for language detection

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 53

Laugh Normalization

bull There are various laughs on each language

ndash HOW MUCH DO YOU LOVE COACH BEISTE

HHAHAHAHAHAH

ndash Hihihihi ) Habe ich regulaumlr 2x die Woche

ndash Tafil con eso Jajajajajajaja

ndash Malo Jejejeje XP

ndash kekeke chỗ đoacute lagravem aacuteo được ko em

bull Shrink them to double

ndash hahahha rArr haha

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 54

Implementation and Estimation

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 55

Language Detection with Infinity-Gram (ldig)

bull tweet language detection for Latin

alphabet

ndash httpsgithubcomshuyoldig

bull MIT license

bull Distribute also the trained model here

ndash infin-gram LR(maximal substring) [Okanohara+ 09]

ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]

ndash Double Array

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 56

Usage (1) Model Initialization

bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]

ndash Extract features from corpus and initialize model

ndash -m model directory

ndash -x path of maximal substring extractor (execute as external process)

ndash --ff Ignore less than the specified value

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 57

Maximal String Extractor

bull maxsubst [input file] [output file]

ndash Input as multiple line text

bull Replace TABs to ldquo ldquo line feeds to U+0001 in it

ndash Output as rdquo[features]yent[frequency]rdquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 58

Usage (2) Learn

bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]

ndash Learn the model using the corpus on 1 cycle of SGD

ndash -e learning rate of SGD

ndash -r regularizer of L1 regularization

ndash --wr what times to regularize for whole parameters

bull Parameters are too many to regularize the whle ones every step

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 59

Usage (3) Shrink Model

bull ldigpy -m [model] --shrink

ndash Remove Unefficient features(all

parameters of which are 0) from the

model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 60

Usage (4) Detect Language

bull ldigpy -m [model] [test data]

ndash Detect languages of test data and output

its result and summary

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 61

Data Format

bull Training and test data

ndash [correct label]yent[meta data]yent[text]

en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD

62

Usage (5) Estimation Tool

bull serverpy -m [model] -p [port number]

ndash Open httplocalhost[port] after it is executed

ndash Output their language probabilities contained features and their parameters for a text inputed in the text area

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 63

Estimation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus

As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size

64

language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992

total 166711 165053 9901 922 974

Estimation for LIGA dataset

bull Estimate using LIGA[Tromp+ 11] dataset

with 9066 tweets for 6 languages

ndash httpwwwwintuenl~mpechenprojectssmm

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Use 19 language model

65

Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996

total 9066 8992 992

Estimation for Europarl Dataset

Only supported languages for ldig

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 66

ldig langdetect CLDlanguage size correct rate correct rate correct rate

bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993

total 21000 13957 997 20850 993 20814 991

Conclusions

bull Language detector using maximal substring model

ndash Detect over 99 accuracy for 19 languages

ndash langdetect with tweet corpus even has 97 accuracy

bull If the corpus is maintained the precision will be still up

ndash There are still many mistakes (in particular da and no)

bull If metadata is added to features the precision will be still up

ndash How to add and train metadata at low cost

bull Desire to shrink the model without loss of precision

ndash Too large for application (gt100MB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 67

References

bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定

bull [Okanohara+ 09] Text Categorization with All Substring Features

bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs

bull [Cavnar+ 94] N-Gram-Based Text Categorization

bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 68

Page 12: Short Text Language Detection with Infinity-Gram

language-detection(langdetect) (Nakatani 2010)

bull Language detection library for Java

ndash httpcodegooglecomplanguage-detection

ndash Apache License 20

ndash Character 3-gram + Bayesian filter

ndash Various normalizations + Feature sampling

bull 99 over precision for 53 languages

ndash Training with Wikipedia abstract

ndash Widely support including Asian languages

ndash Adopted by Apache Solr

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 14

Estimation with News Text

bull Test for crawled news text from web in 49 languages Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 15

Language size accuracyaf Afrikaans 200 199 (9950)ar Arabic 200 200 (10000)bg Bulgarian 200 200 (10000)bn Bengali 200 200 (10000)cs Czech 200 200 (10000)da Dannish 200 179 (8950)de German 200 200 (10000)el Greek 200 200 (10000)en English 200 200 (10000)es Spanish 200 200 (10000)fa Persian 200 200 (10000)fi Finnish 200 200 (10000)fr French 200 200 (10000)gu Gujarati 200 200 (10000)he Hebrew 200 200 (10000)hi Hindi 200 200 (10000)hr Croatian 200 200 (10000)hu Hungarian 200 200 (10000)id Indonesian 200 200 (10000)it Italian 200 200 (10000)ja Japanese 200 200 (10000)kn Kannada 200 200 (10000)ko Korean 200 200 (10000)mk Macedonian 200 200 (10000)ml Malayalam 200 200 (10000)

Language size accuracymr Marathi 200 200 (10000)ne Nepali 200 200 (10000)nl Dutch 200 200 (10000)no Norwegian 200 199 (9950)pa Punjabi 200 200 (10000)pl Polish 200 200 (10000)pt Portuguese 200 200 (10000)ro Romanian 200 200 (10000)ru Russian 200 200 (10000)sk Slovak 200 200 (10000)so Somali 200 200 (10000)sq Albanian 200 200 (10000)sv Swedish 200 200 (10000)sw Swahili 200 200 (10000)ta Tamil 200 200 (10000)te Telugu 200 200 (10000)th Thai 200 200 (10000)tl Tagalog 200 200 (10000)tr Turkish 200 200 (10000)uk Ukrainian 200 200 (10000)ur Urdu 200 200 (10000)vi Vietnamese 200 200 (10000)

zh-cn Simplified Chinese 200 200 (10000)zh-tw Traditional Chinese 200 200 (10000)

total 9800 9777 (9977)

Estimation with Europarl datasets

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 16

bull Test for 1000 samples for each

language from Europarl Parallel Corpus

ndash from the proceedings of the European Parliament

ndash httpwwwstatmtorgeuroparl

bull httpcodegooglecomplanguage-

detectiondownloadsdetailname=eur

oparl-testzip

language size correct accuracybg Bulgarian 1000 988 988cs Czech 1000 994 994da Dannish 1000 968 968de German 1000 998 998el Greek 1000 1000 1000en English 1000 996 996es Spanish 1000 996 996et Estonian 1000 996 996fi Finnish 1000 998 998fr French 1000 999 999hu Hungarian 1000 999 999it Italian 1000 999 999lt Lithuanian 1000 997 997lv Latvian 1000 999 999nl Dutch 1000 974 974pl Polish 1000 999 999pt Portuguese 1000 996 996ro Romanian 1000 999 999sk Slovak 1000 988 988sl Slovene 1000 976 976sv Swedish 1000 991 991

total 21000 20850 993

Language Detection has been over isnt it

17

We still have ENEMY to beat

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 18

Twitter Language Detection with the Existing Methods

bull Only 90-95 accuracy

for tweet corpus

bull LD = language-detection

bull CLD = Chromium Compact Language

Detection

ndash httpcodegooglecompchromium-

compact-language-detector

ndash regard ms(Malay) as id(Indonesian)

bull Tika = Apache Tika

ndash httptikaapacheorg

ndash Estimate on 15 languages which Tika

supports in our tweet corpus

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 19

language LD CLD Tikaca Catalan 953 930 838cs Czech 963 966 ----da Dannish 945 907 587de German 866 968 731en English 883 974 547es Spanish 915 905 444fi Finnish 989 994 948fr French 950 945 674hu Hungarian 858 890 762id Indonesian 897 928 ----it Italian 962 938 871nl Dutch 695 932 650no Norwegian 960 749 686pl Polish 980 978 888pt Portuguese 880 886 474ro Romanian 928 961 826sv Swedish 960 964 756tr Turkish 976 974 ----vi Vietnamese 987 989 ----

total 922 938 700

Chromium Compact Language Detection (CLD)

bull Porting the language detector from

Google Chromium ndash httpcodegooglecompchromium-compact-language-detector

ndash Implementation in C++ Python binding

ndash of supported languages CLD = 76

langdetect = 53

ndash Accuracy CLD = 9882 langdetect =

9922

bull for 17 languages on Europarl datasets bull httpblogmikemccandlesscom201110accuracy-and-performance-of-googleshtml

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 20

Is twitter Language Detection difficult (1)

bull Tweet is too short to extract 3-gram features

ndash At most 140 characters on twitter

ndash URLs mentions and hashtags are not useful to

detect

bull LIGA [Tromp+ 11]

ndash Graph-features based on 3-gram

bull Add long distance features

bull 95~98 accuracy for twitter Language Detection

bull 6 languages (de en es fr it nl)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 21

Is twitter Language Detection difficult (2)

bull Tweet is too noisy

ndash Representations against the languages orthography often appear

ndash Acronym Abbreviation lengthened word (like Cooooolll)

bull Likelihood of tweet tends to get smaller on normal language model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

OMG Oh My God

LOL Laughing Out Loud

LMAO Laughing My Ass Out

F4F Follow for Follow

MDR Mort de Rire (French)

TKT Ne tlsquoInquiegravete Pas (Fr)

u you

ur your

4 for

i0u I love you

k che (Italian)

anke anche(Italian)

Letter k isnt used in Italian

22

Motivation to Detect Short Text Language

bull There are many small chunks of text in addition to twitter

ndash Schedule search query bulletin board and so on

ndash There are many questions about short text detection in the Issues Board of langdetect Project

bull httpcodegooglecomplanguage-detectionissuesdetailid=10

bull Detection for multi-language mixed text

ndash Cut the target document in paragraphs or lines

ndash Detect for each short text

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 23

Our Goal

bull Over 99 accuracy

ndash However it is too difficult to detect one

word sentence

ndash Our Goal is 99+ accurate detection for

sentence with more than 3 words

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 24

We need

bull Rich feature extractable model from

short text

ndash Maximal substring model

(infin-gram Logistic Regression)

bull and twitter-specific Language model

or Corpus to construct it

ndash about 700K tweet corpus with language

label

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 25

Proposal Method

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 26

How to increase features from 3-grams

bull The more n the more features

bull Maximum at n=infin that is all substring

ndash But it has O(T2) order

gram of n-gram

freq≧1 freq≧2 freq≧10

1 79 72 57

2 1896 1533 902

3 15970 10369 4525

4 64966 33941 10534

5 167543 69719 15538

6 323749 107861 18970

7 524634 142954 21093

8 760719 171995 22159

9 921361 193995 22696

cumulative distributuion of feature length for 5090 normalized English tweets (300KB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 27

Text Categorization with All Substring Features [Okanohara+ 09]

bull Multiclass Logistic Regression using all

substrings as features

ndash Maximal Substring makes the equivalent

model that can be constructed in linear

time

ndash Store features into TRIE fast prediction

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 28

Maximal Substring (1)

bull Define a containment(semi-order)

among non empty substrings

abracadabra

ndash ldquorardquo sub ldquobraldquo hArr all rdquorardquo occur

as the substring of ldquobrardquo

ndash ldquoardquo nsub ldquoraldquo hArr ldquoardquo occur in not only ldquoraldquo

but also ldquocardquo It is strictly defined with also its position in the substring

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 29

Maximal Substring (2)

bull Each equivalent class formed by the containment relationship has a unique maximal element that is named Maximal Substring

bull Maximal substrings of abracadabra are a abra and abracadabra

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 30

via httpdhatenanejpnokuno201202031328237067

Maximal Substring and Infinity-Gram

bull Frequencies of substrings that have a containment relationship always equal

bull In the model with linear combination of features it is possible to enclose the common feature values

bull Logistic regression with maximal substrings is equivalent to the one with infinity-grams

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Although the equivalence collapses for test set

we assumes that it can be approximated by a sufficiently large training set

31

Extended Suffix Array

bull Extended Suffix Array consists of

ndash SA=Suffix Array

ndash L=Longest Common Prefixes

ndash B=Burrows-Wheelers Transformed text

bull A maximal substring that occurs more than once corresponds to a internal node of Suffix Tree which is equivalent to a suffix with Lgt0 and BWT has more than 1 character type

ndash They can be calculated on linear time

bull esaxx Okanoharas implement of ESA

ndash httpcodegooglecompesaxx

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 32

via [Okanohara+ 09]

Corpus and Normalization

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 33

Target Languages

bull Limit character type to detect

ndash In short text detection mixed text can be

divided to type of characters

bull Latin alphabet language

ndash The most difficult alphabet type to detect

ndash Languages which speakers are over 5

million are more than 25

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 34

Whats Latin Alphabet

bull Latin alphabet ne ascii alphabet

ndash aring ą aelig eth Ħ ŋ and so on

bull They are assigned to 9 code blocks in Unicode

Range Name Supplement

U+0000-007F Basic Latin ascii

U+0080-00FF Latin-1 Supplement Most languages are covered with these U+0100-017F Latin Extended-A

U+0180-024F Latin Extended-B Rumanian

U+0250-02AF IPA Extensions

U+0300-036F Combining Diacritical Marks for tone symbol composition

U+1E00-1EFF Latin Extended Additional Vietnamese

U+2C60-2C7F Latin Extended-C These arenrsquot used by almost all present languages U+A720-A7FF Latin Extended-D

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 35

Latin Alphabets in Unicode Codepoint Chart

for Vietnamese only use often use sometimes

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 36

How to Create Corpus

bull Collect tweets with sample method of

twitter Streaming API

ndash Sampling 1 of all tweets (about 2

million tweets)

ndash Tweets in Latin alphabet language

account for 60 of them

bull The rest is only to annotate language

labels to these tweets

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 37

Language Label Annotation

bull Group tweets by their timezone

ndash French tweets account for about 1 of all ones

ndash But they account for 50 of ones in Paris

timezone only

bull Annotate tentative labels to tweets using

langdetect

ndash Remove non-French tweets from ones labeled lsquofrrsquo

ndash Recover French tweets from ones not labeled lsquofrrsquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 38

( 20 of the whole tweets have no timezone)

How to annotate

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 39

Swedish Norwegian Danish Vietnamese Lithuanian

Czech Hungarian Catalan Rumanian and Polish guides in turn

Created Corpus

bull Noiseless tweets for training data

bull Noiseful tweets with more than 3 words as test data

bull Work with Rauacutel Velaz and Hiroshi Manabe for Catalan corpus creation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 40

language training testca Catalan 9089 5082cs Czech 9082 7682da Dannish 7388 5524de German 44448 10065en English 44520 10168es Spanish 44118 10265fi Finnish 8087 7050fr French 44339 10098hu Hungarian 10030 4904id Indonesian 44722 10181it Italian 43366 10152nl Dutch 44682 10007no Norwegian 10124 8496pl Polish 16771 10152pt Portuguese 44215 10208ro Romanian 10021 5911sv Swedish 44054 10032tr Turkish 44703 10308vi Vietnamese 15030 10488

total 538789 166773

Simple Language Detection

bull Language detector can be constructed

from maximal substring model and

twitter corpus

ndash It still gets at most 98 accuracy

bull We guess it is necessary to reduce bias

ndash data size bias

ndash language-specific bias

ndash twitter-specific bias

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 41

Bias by Data Size

bull Tweet size in each language has huge bias

bull Level them out by sampling with replacement from each language up to the largest data

ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement

English

Portuguese

Spanish

Indonesian

Dutch

French

German

Turkish

Italian

Swedish

othersShort Text Language Detection with Infinity-Gram

(NAIST Seminar) 42

Convert to Lowercase on Multiple Languages

bull Conversion into lower case saves corpus and compresses model

bull But the lower case of I (U+0049) in Turkish differs from others

bull Convert to lower case excluding lsquoIrsquo

Upper case Lower case

Turkish

Azerbaijani

I (U+0049) ı (U+0131)

İ (U+0130) i (U+0069)

Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 43

Normalization for Rumanian

bull Rumanian uses acirc ă icirc ș ț in addition to a-z

bull There are 2 character type as st with a ldquobeardrdquo

ndash U+015E-F U+0162-3 st with cedilla

ndash U+0218-B st with comma below

bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia

bull The 2 code has the same design in some fonts

ndash Indistinguishable

ș ş U+0219 U+015F

ț ţ U+021B U+0163

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

44

Rumanian Character Affairs on PC

bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently

ndash 1989 Democratization in Rumania

ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode

ndash 2007 Rumania seated in the EU

ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 45

lsquost with cedillarsquo is used

on an advertisement board

in Bucharest

Normalization for Substitute Characters

bull lsquost with cedillarsquo are substitute characters

ndash But they are more popular than the others

ndash with cedilla with comma = 2 1

ndash ldquoRumanian IMErdquo outputs the substitutes too D

bull Regard lsquost with commarsquo as lsquost with cedillarsquo

ț ţ U+021B U+0163

I reckon it is similar to the relationship of

Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 46

Arabic Character Normalization (on language-detection)

bull Arabic and Persian have the similar trouble too

bull Character lsquoyehrsquo in Farsi corresponds to 2 code points

ndash Wikipedia uses ی (U+06cc Farsi yeh) only

ndash News uses ي(U+064a Arabic yeh) only

bull U+064a is a substitute in Farsi

ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc

ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails

bull Regard U+06cc as U+064a

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 47

Normalization for Vietnamese (1)

bull Vietnamese has 12 vowels

ndash a ă acirc e ecirc i y o ocirc ơ u ư

bull Vietnamese has 6 tones

ndash a ả agrave atilde aacute ạ

ndash These tone symbols are used also in general documents like news

bull The tone symbols can be appended to all vowels

ndash 12 6 = 72

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 48

Normalization for Vietnamese (2)

bull Representation of vowels with

tones

1 Use U+1ea0 - U+1ef9

bull ẵ = U+1eb5

2 Combine with Diacritical Marks

bull ẵ = U+0103 U+0303

ndash Half and half on news and tweet

bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 49

CJK-Kanji Normalization (1) (on language-detection)

bull CJK-Kanji has too many characters(more than 20K)

ndash Other character types have only 30-50 characters

bull The character space is very sparse

ndash Characters that donrsquot occur in the training corpus have no probabilities

bull eg 谢谢 Kanji for person name

ndash Common frequent characters are too strong

bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese

bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 50

CJK-Kanji Normalization (2) (on language-detection)

bull Group Kanjis by frequency and normalize each group to the representative character

ndash (1) K-means clustering

bull Use tf-idf on Wikipedia and Google News

bull K=50 (size of ascii alphabet = 52)

ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese

bull Simplified Chinese 现代汉语常用字表(3500)

bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)

bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998

ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much

bull Generate 130 clusters from product of (1) and (2)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 51

Normalization for twitter

bull Remove simply

ndash URL

ndash mention

ndash hash tag

ndash RT

ndash face mark using alphabet like XD p

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 52

Normalization for twitter-Specific Representation

bull How to Like lsquocoooooooollllllrsquo

bull Case 1 Make a normalization dictionary using [Brody+ 2011]

ndash Unsupervised normalization like coooollll rarr cool

ndash It canrsquot handle words that are not in the dictionary

bull Case 2 If the same character continues in more than 3 Shrink it to 2

ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of

bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on

bull Acronym (like WWW СССР) is not useful for language detection

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 53

Laugh Normalization

bull There are various laughs on each language

ndash HOW MUCH DO YOU LOVE COACH BEISTE

HHAHAHAHAHAH

ndash Hihihihi ) Habe ich regulaumlr 2x die Woche

ndash Tafil con eso Jajajajajajaja

ndash Malo Jejejeje XP

ndash kekeke chỗ đoacute lagravem aacuteo được ko em

bull Shrink them to double

ndash hahahha rArr haha

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 54

Implementation and Estimation

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 55

Language Detection with Infinity-Gram (ldig)

bull tweet language detection for Latin

alphabet

ndash httpsgithubcomshuyoldig

bull MIT license

bull Distribute also the trained model here

ndash infin-gram LR(maximal substring) [Okanohara+ 09]

ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]

ndash Double Array

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 56

Usage (1) Model Initialization

bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]

ndash Extract features from corpus and initialize model

ndash -m model directory

ndash -x path of maximal substring extractor (execute as external process)

ndash --ff Ignore less than the specified value

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 57

Maximal String Extractor

bull maxsubst [input file] [output file]

ndash Input as multiple line text

bull Replace TABs to ldquo ldquo line feeds to U+0001 in it

ndash Output as rdquo[features]yent[frequency]rdquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 58

Usage (2) Learn

bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]

ndash Learn the model using the corpus on 1 cycle of SGD

ndash -e learning rate of SGD

ndash -r regularizer of L1 regularization

ndash --wr what times to regularize for whole parameters

bull Parameters are too many to regularize the whle ones every step

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 59

Usage (3) Shrink Model

bull ldigpy -m [model] --shrink

ndash Remove Unefficient features(all

parameters of which are 0) from the

model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 60

Usage (4) Detect Language

bull ldigpy -m [model] [test data]

ndash Detect languages of test data and output

its result and summary

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 61

Data Format

bull Training and test data

ndash [correct label]yent[meta data]yent[text]

en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD

62

Usage (5) Estimation Tool

bull serverpy -m [model] -p [port number]

ndash Open httplocalhost[port] after it is executed

ndash Output their language probabilities contained features and their parameters for a text inputed in the text area

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 63

Estimation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus

As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size

64

language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992

total 166711 165053 9901 922 974

Estimation for LIGA dataset

bull Estimate using LIGA[Tromp+ 11] dataset

with 9066 tweets for 6 languages

ndash httpwwwwintuenl~mpechenprojectssmm

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Use 19 language model

65

Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996

total 9066 8992 992

Estimation for Europarl Dataset

Only supported languages for ldig

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 66

ldig langdetect CLDlanguage size correct rate correct rate correct rate

bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993

total 21000 13957 997 20850 993 20814 991

Conclusions

bull Language detector using maximal substring model

ndash Detect over 99 accuracy for 19 languages

ndash langdetect with tweet corpus even has 97 accuracy

bull If the corpus is maintained the precision will be still up

ndash There are still many mistakes (in particular da and no)

bull If metadata is added to features the precision will be still up

ndash How to add and train metadata at low cost

bull Desire to shrink the model without loss of precision

ndash Too large for application (gt100MB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 67

References

bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定

bull [Okanohara+ 09] Text Categorization with All Substring Features

bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs

bull [Cavnar+ 94] N-Gram-Based Text Categorization

bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 68

Page 13: Short Text Language Detection with Infinity-Gram

Estimation with News Text

bull Test for crawled news text from web in 49 languages Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 15

Language size accuracyaf Afrikaans 200 199 (9950)ar Arabic 200 200 (10000)bg Bulgarian 200 200 (10000)bn Bengali 200 200 (10000)cs Czech 200 200 (10000)da Dannish 200 179 (8950)de German 200 200 (10000)el Greek 200 200 (10000)en English 200 200 (10000)es Spanish 200 200 (10000)fa Persian 200 200 (10000)fi Finnish 200 200 (10000)fr French 200 200 (10000)gu Gujarati 200 200 (10000)he Hebrew 200 200 (10000)hi Hindi 200 200 (10000)hr Croatian 200 200 (10000)hu Hungarian 200 200 (10000)id Indonesian 200 200 (10000)it Italian 200 200 (10000)ja Japanese 200 200 (10000)kn Kannada 200 200 (10000)ko Korean 200 200 (10000)mk Macedonian 200 200 (10000)ml Malayalam 200 200 (10000)

Language size accuracymr Marathi 200 200 (10000)ne Nepali 200 200 (10000)nl Dutch 200 200 (10000)no Norwegian 200 199 (9950)pa Punjabi 200 200 (10000)pl Polish 200 200 (10000)pt Portuguese 200 200 (10000)ro Romanian 200 200 (10000)ru Russian 200 200 (10000)sk Slovak 200 200 (10000)so Somali 200 200 (10000)sq Albanian 200 200 (10000)sv Swedish 200 200 (10000)sw Swahili 200 200 (10000)ta Tamil 200 200 (10000)te Telugu 200 200 (10000)th Thai 200 200 (10000)tl Tagalog 200 200 (10000)tr Turkish 200 200 (10000)uk Ukrainian 200 200 (10000)ur Urdu 200 200 (10000)vi Vietnamese 200 200 (10000)

zh-cn Simplified Chinese 200 200 (10000)zh-tw Traditional Chinese 200 200 (10000)

total 9800 9777 (9977)

Estimation with Europarl datasets

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 16

bull Test for 1000 samples for each

language from Europarl Parallel Corpus

ndash from the proceedings of the European Parliament

ndash httpwwwstatmtorgeuroparl

bull httpcodegooglecomplanguage-

detectiondownloadsdetailname=eur

oparl-testzip

language size correct accuracybg Bulgarian 1000 988 988cs Czech 1000 994 994da Dannish 1000 968 968de German 1000 998 998el Greek 1000 1000 1000en English 1000 996 996es Spanish 1000 996 996et Estonian 1000 996 996fi Finnish 1000 998 998fr French 1000 999 999hu Hungarian 1000 999 999it Italian 1000 999 999lt Lithuanian 1000 997 997lv Latvian 1000 999 999nl Dutch 1000 974 974pl Polish 1000 999 999pt Portuguese 1000 996 996ro Romanian 1000 999 999sk Slovak 1000 988 988sl Slovene 1000 976 976sv Swedish 1000 991 991

total 21000 20850 993

Language Detection has been over isnt it

17

We still have ENEMY to beat

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 18

Twitter Language Detection with the Existing Methods

bull Only 90-95 accuracy

for tweet corpus

bull LD = language-detection

bull CLD = Chromium Compact Language

Detection

ndash httpcodegooglecompchromium-

compact-language-detector

ndash regard ms(Malay) as id(Indonesian)

bull Tika = Apache Tika

ndash httptikaapacheorg

ndash Estimate on 15 languages which Tika

supports in our tweet corpus

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 19

language LD CLD Tikaca Catalan 953 930 838cs Czech 963 966 ----da Dannish 945 907 587de German 866 968 731en English 883 974 547es Spanish 915 905 444fi Finnish 989 994 948fr French 950 945 674hu Hungarian 858 890 762id Indonesian 897 928 ----it Italian 962 938 871nl Dutch 695 932 650no Norwegian 960 749 686pl Polish 980 978 888pt Portuguese 880 886 474ro Romanian 928 961 826sv Swedish 960 964 756tr Turkish 976 974 ----vi Vietnamese 987 989 ----

total 922 938 700

Chromium Compact Language Detection (CLD)

bull Porting the language detector from

Google Chromium ndash httpcodegooglecompchromium-compact-language-detector

ndash Implementation in C++ Python binding

ndash of supported languages CLD = 76

langdetect = 53

ndash Accuracy CLD = 9882 langdetect =

9922

bull for 17 languages on Europarl datasets bull httpblogmikemccandlesscom201110accuracy-and-performance-of-googleshtml

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 20

Is twitter Language Detection difficult (1)

bull Tweet is too short to extract 3-gram features

ndash At most 140 characters on twitter

ndash URLs mentions and hashtags are not useful to

detect

bull LIGA [Tromp+ 11]

ndash Graph-features based on 3-gram

bull Add long distance features

bull 95~98 accuracy for twitter Language Detection

bull 6 languages (de en es fr it nl)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 21

Is twitter Language Detection difficult (2)

bull Tweet is too noisy

ndash Representations against the languages orthography often appear

ndash Acronym Abbreviation lengthened word (like Cooooolll)

bull Likelihood of tweet tends to get smaller on normal language model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

OMG Oh My God

LOL Laughing Out Loud

LMAO Laughing My Ass Out

F4F Follow for Follow

MDR Mort de Rire (French)

TKT Ne tlsquoInquiegravete Pas (Fr)

u you

ur your

4 for

i0u I love you

k che (Italian)

anke anche(Italian)

Letter k isnt used in Italian

22

Motivation to Detect Short Text Language

bull There are many small chunks of text in addition to twitter

ndash Schedule search query bulletin board and so on

ndash There are many questions about short text detection in the Issues Board of langdetect Project

bull httpcodegooglecomplanguage-detectionissuesdetailid=10

bull Detection for multi-language mixed text

ndash Cut the target document in paragraphs or lines

ndash Detect for each short text

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 23

Our Goal

bull Over 99 accuracy

ndash However it is too difficult to detect one

word sentence

ndash Our Goal is 99+ accurate detection for

sentence with more than 3 words

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 24

We need

bull Rich feature extractable model from

short text

ndash Maximal substring model

(infin-gram Logistic Regression)

bull and twitter-specific Language model

or Corpus to construct it

ndash about 700K tweet corpus with language

label

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 25

Proposal Method

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 26

How to increase features from 3-grams

bull The more n the more features

bull Maximum at n=infin that is all substring

ndash But it has O(T2) order

gram of n-gram

freq≧1 freq≧2 freq≧10

1 79 72 57

2 1896 1533 902

3 15970 10369 4525

4 64966 33941 10534

5 167543 69719 15538

6 323749 107861 18970

7 524634 142954 21093

8 760719 171995 22159

9 921361 193995 22696

cumulative distributuion of feature length for 5090 normalized English tweets (300KB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 27

Text Categorization with All Substring Features [Okanohara+ 09]

bull Multiclass Logistic Regression using all

substrings as features

ndash Maximal Substring makes the equivalent

model that can be constructed in linear

time

ndash Store features into TRIE fast prediction

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 28

Maximal Substring (1)

bull Define a containment(semi-order)

among non empty substrings

abracadabra

ndash ldquorardquo sub ldquobraldquo hArr all rdquorardquo occur

as the substring of ldquobrardquo

ndash ldquoardquo nsub ldquoraldquo hArr ldquoardquo occur in not only ldquoraldquo

but also ldquocardquo It is strictly defined with also its position in the substring

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 29

Maximal Substring (2)

bull Each equivalent class formed by the containment relationship has a unique maximal element that is named Maximal Substring

bull Maximal substrings of abracadabra are a abra and abracadabra

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 30

via httpdhatenanejpnokuno201202031328237067

Maximal Substring and Infinity-Gram

bull Frequencies of substrings that have a containment relationship always equal

bull In the model with linear combination of features it is possible to enclose the common feature values

bull Logistic regression with maximal substrings is equivalent to the one with infinity-grams

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Although the equivalence collapses for test set

we assumes that it can be approximated by a sufficiently large training set

31

Extended Suffix Array

bull Extended Suffix Array consists of

ndash SA=Suffix Array

ndash L=Longest Common Prefixes

ndash B=Burrows-Wheelers Transformed text

bull A maximal substring that occurs more than once corresponds to a internal node of Suffix Tree which is equivalent to a suffix with Lgt0 and BWT has more than 1 character type

ndash They can be calculated on linear time

bull esaxx Okanoharas implement of ESA

ndash httpcodegooglecompesaxx

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 32

via [Okanohara+ 09]

Corpus and Normalization

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 33

Target Languages

bull Limit character type to detect

ndash In short text detection mixed text can be

divided to type of characters

bull Latin alphabet language

ndash The most difficult alphabet type to detect

ndash Languages which speakers are over 5

million are more than 25

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 34

Whats Latin Alphabet

bull Latin alphabet ne ascii alphabet

ndash aring ą aelig eth Ħ ŋ and so on

bull They are assigned to 9 code blocks in Unicode

Range Name Supplement

U+0000-007F Basic Latin ascii

U+0080-00FF Latin-1 Supplement Most languages are covered with these U+0100-017F Latin Extended-A

U+0180-024F Latin Extended-B Rumanian

U+0250-02AF IPA Extensions

U+0300-036F Combining Diacritical Marks for tone symbol composition

U+1E00-1EFF Latin Extended Additional Vietnamese

U+2C60-2C7F Latin Extended-C These arenrsquot used by almost all present languages U+A720-A7FF Latin Extended-D

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 35

Latin Alphabets in Unicode Codepoint Chart

for Vietnamese only use often use sometimes

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 36

How to Create Corpus

bull Collect tweets with sample method of

twitter Streaming API

ndash Sampling 1 of all tweets (about 2

million tweets)

ndash Tweets in Latin alphabet language

account for 60 of them

bull The rest is only to annotate language

labels to these tweets

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 37

Language Label Annotation

bull Group tweets by their timezone

ndash French tweets account for about 1 of all ones

ndash But they account for 50 of ones in Paris

timezone only

bull Annotate tentative labels to tweets using

langdetect

ndash Remove non-French tweets from ones labeled lsquofrrsquo

ndash Recover French tweets from ones not labeled lsquofrrsquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 38

( 20 of the whole tweets have no timezone)

How to annotate

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 39

Swedish Norwegian Danish Vietnamese Lithuanian

Czech Hungarian Catalan Rumanian and Polish guides in turn

Created Corpus

bull Noiseless tweets for training data

bull Noiseful tweets with more than 3 words as test data

bull Work with Rauacutel Velaz and Hiroshi Manabe for Catalan corpus creation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 40

language training testca Catalan 9089 5082cs Czech 9082 7682da Dannish 7388 5524de German 44448 10065en English 44520 10168es Spanish 44118 10265fi Finnish 8087 7050fr French 44339 10098hu Hungarian 10030 4904id Indonesian 44722 10181it Italian 43366 10152nl Dutch 44682 10007no Norwegian 10124 8496pl Polish 16771 10152pt Portuguese 44215 10208ro Romanian 10021 5911sv Swedish 44054 10032tr Turkish 44703 10308vi Vietnamese 15030 10488

total 538789 166773

Simple Language Detection

bull Language detector can be constructed

from maximal substring model and

twitter corpus

ndash It still gets at most 98 accuracy

bull We guess it is necessary to reduce bias

ndash data size bias

ndash language-specific bias

ndash twitter-specific bias

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 41

Bias by Data Size

bull Tweet size in each language has huge bias

bull Level them out by sampling with replacement from each language up to the largest data

ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement

English

Portuguese

Spanish

Indonesian

Dutch

French

German

Turkish

Italian

Swedish

othersShort Text Language Detection with Infinity-Gram

(NAIST Seminar) 42

Convert to Lowercase on Multiple Languages

bull Conversion into lower case saves corpus and compresses model

bull But the lower case of I (U+0049) in Turkish differs from others

bull Convert to lower case excluding lsquoIrsquo

Upper case Lower case

Turkish

Azerbaijani

I (U+0049) ı (U+0131)

İ (U+0130) i (U+0069)

Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 43

Normalization for Rumanian

bull Rumanian uses acirc ă icirc ș ț in addition to a-z

bull There are 2 character type as st with a ldquobeardrdquo

ndash U+015E-F U+0162-3 st with cedilla

ndash U+0218-B st with comma below

bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia

bull The 2 code has the same design in some fonts

ndash Indistinguishable

ș ş U+0219 U+015F

ț ţ U+021B U+0163

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

44

Rumanian Character Affairs on PC

bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently

ndash 1989 Democratization in Rumania

ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode

ndash 2007 Rumania seated in the EU

ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 45

lsquost with cedillarsquo is used

on an advertisement board

in Bucharest

Normalization for Substitute Characters

bull lsquost with cedillarsquo are substitute characters

ndash But they are more popular than the others

ndash with cedilla with comma = 2 1

ndash ldquoRumanian IMErdquo outputs the substitutes too D

bull Regard lsquost with commarsquo as lsquost with cedillarsquo

ț ţ U+021B U+0163

I reckon it is similar to the relationship of

Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 46

Arabic Character Normalization (on language-detection)

bull Arabic and Persian have the similar trouble too

bull Character lsquoyehrsquo in Farsi corresponds to 2 code points

ndash Wikipedia uses ی (U+06cc Farsi yeh) only

ndash News uses ي(U+064a Arabic yeh) only

bull U+064a is a substitute in Farsi

ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc

ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails

bull Regard U+06cc as U+064a

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 47

Normalization for Vietnamese (1)

bull Vietnamese has 12 vowels

ndash a ă acirc e ecirc i y o ocirc ơ u ư

bull Vietnamese has 6 tones

ndash a ả agrave atilde aacute ạ

ndash These tone symbols are used also in general documents like news

bull The tone symbols can be appended to all vowels

ndash 12 6 = 72

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 48

Normalization for Vietnamese (2)

bull Representation of vowels with

tones

1 Use U+1ea0 - U+1ef9

bull ẵ = U+1eb5

2 Combine with Diacritical Marks

bull ẵ = U+0103 U+0303

ndash Half and half on news and tweet

bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 49

CJK-Kanji Normalization (1) (on language-detection)

bull CJK-Kanji has too many characters(more than 20K)

ndash Other character types have only 30-50 characters

bull The character space is very sparse

ndash Characters that donrsquot occur in the training corpus have no probabilities

bull eg 谢谢 Kanji for person name

ndash Common frequent characters are too strong

bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese

bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 50

CJK-Kanji Normalization (2) (on language-detection)

bull Group Kanjis by frequency and normalize each group to the representative character

ndash (1) K-means clustering

bull Use tf-idf on Wikipedia and Google News

bull K=50 (size of ascii alphabet = 52)

ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese

bull Simplified Chinese 现代汉语常用字表(3500)

bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)

bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998

ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much

bull Generate 130 clusters from product of (1) and (2)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 51

Normalization for twitter

bull Remove simply

ndash URL

ndash mention

ndash hash tag

ndash RT

ndash face mark using alphabet like XD p

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 52

Normalization for twitter-Specific Representation

bull How to Like lsquocoooooooollllllrsquo

bull Case 1 Make a normalization dictionary using [Brody+ 2011]

ndash Unsupervised normalization like coooollll rarr cool

ndash It canrsquot handle words that are not in the dictionary

bull Case 2 If the same character continues in more than 3 Shrink it to 2

ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of

bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on

bull Acronym (like WWW СССР) is not useful for language detection

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 53

Laugh Normalization

bull There are various laughs on each language

ndash HOW MUCH DO YOU LOVE COACH BEISTE

HHAHAHAHAHAH

ndash Hihihihi ) Habe ich regulaumlr 2x die Woche

ndash Tafil con eso Jajajajajajaja

ndash Malo Jejejeje XP

ndash kekeke chỗ đoacute lagravem aacuteo được ko em

bull Shrink them to double

ndash hahahha rArr haha

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 54

Implementation and Estimation

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 55

Language Detection with Infinity-Gram (ldig)

bull tweet language detection for Latin

alphabet

ndash httpsgithubcomshuyoldig

bull MIT license

bull Distribute also the trained model here

ndash infin-gram LR(maximal substring) [Okanohara+ 09]

ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]

ndash Double Array

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 56

Usage (1) Model Initialization

bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]

ndash Extract features from corpus and initialize model

ndash -m model directory

ndash -x path of maximal substring extractor (execute as external process)

ndash --ff Ignore less than the specified value

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 57

Maximal String Extractor

bull maxsubst [input file] [output file]

ndash Input as multiple line text

bull Replace TABs to ldquo ldquo line feeds to U+0001 in it

ndash Output as rdquo[features]yent[frequency]rdquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 58

Usage (2) Learn

bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]

ndash Learn the model using the corpus on 1 cycle of SGD

ndash -e learning rate of SGD

ndash -r regularizer of L1 regularization

ndash --wr what times to regularize for whole parameters

bull Parameters are too many to regularize the whle ones every step

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 59

Usage (3) Shrink Model

bull ldigpy -m [model] --shrink

ndash Remove Unefficient features(all

parameters of which are 0) from the

model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 60

Usage (4) Detect Language

bull ldigpy -m [model] [test data]

ndash Detect languages of test data and output

its result and summary

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 61

Data Format

bull Training and test data

ndash [correct label]yent[meta data]yent[text]

en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD

62

Usage (5) Estimation Tool

bull serverpy -m [model] -p [port number]

ndash Open httplocalhost[port] after it is executed

ndash Output their language probabilities contained features and their parameters for a text inputed in the text area

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 63

Estimation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus

As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size

64

language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992

total 166711 165053 9901 922 974

Estimation for LIGA dataset

bull Estimate using LIGA[Tromp+ 11] dataset

with 9066 tweets for 6 languages

ndash httpwwwwintuenl~mpechenprojectssmm

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Use 19 language model

65

Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996

total 9066 8992 992

Estimation for Europarl Dataset

Only supported languages for ldig

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 66

ldig langdetect CLDlanguage size correct rate correct rate correct rate

bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993

total 21000 13957 997 20850 993 20814 991

Conclusions

bull Language detector using maximal substring model

ndash Detect over 99 accuracy for 19 languages

ndash langdetect with tweet corpus even has 97 accuracy

bull If the corpus is maintained the precision will be still up

ndash There are still many mistakes (in particular da and no)

bull If metadata is added to features the precision will be still up

ndash How to add and train metadata at low cost

bull Desire to shrink the model without loss of precision

ndash Too large for application (gt100MB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 67

References

bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定

bull [Okanohara+ 09] Text Categorization with All Substring Features

bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs

bull [Cavnar+ 94] N-Gram-Based Text Categorization

bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 68

Page 14: Short Text Language Detection with Infinity-Gram

Estimation with Europarl datasets

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 16

bull Test for 1000 samples for each

language from Europarl Parallel Corpus

ndash from the proceedings of the European Parliament

ndash httpwwwstatmtorgeuroparl

bull httpcodegooglecomplanguage-

detectiondownloadsdetailname=eur

oparl-testzip

language size correct accuracybg Bulgarian 1000 988 988cs Czech 1000 994 994da Dannish 1000 968 968de German 1000 998 998el Greek 1000 1000 1000en English 1000 996 996es Spanish 1000 996 996et Estonian 1000 996 996fi Finnish 1000 998 998fr French 1000 999 999hu Hungarian 1000 999 999it Italian 1000 999 999lt Lithuanian 1000 997 997lv Latvian 1000 999 999nl Dutch 1000 974 974pl Polish 1000 999 999pt Portuguese 1000 996 996ro Romanian 1000 999 999sk Slovak 1000 988 988sl Slovene 1000 976 976sv Swedish 1000 991 991

total 21000 20850 993

Language Detection has been over isnt it

17

We still have ENEMY to beat

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 18

Twitter Language Detection with the Existing Methods

bull Only 90-95 accuracy

for tweet corpus

bull LD = language-detection

bull CLD = Chromium Compact Language

Detection

ndash httpcodegooglecompchromium-

compact-language-detector

ndash regard ms(Malay) as id(Indonesian)

bull Tika = Apache Tika

ndash httptikaapacheorg

ndash Estimate on 15 languages which Tika

supports in our tweet corpus

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 19

language LD CLD Tikaca Catalan 953 930 838cs Czech 963 966 ----da Dannish 945 907 587de German 866 968 731en English 883 974 547es Spanish 915 905 444fi Finnish 989 994 948fr French 950 945 674hu Hungarian 858 890 762id Indonesian 897 928 ----it Italian 962 938 871nl Dutch 695 932 650no Norwegian 960 749 686pl Polish 980 978 888pt Portuguese 880 886 474ro Romanian 928 961 826sv Swedish 960 964 756tr Turkish 976 974 ----vi Vietnamese 987 989 ----

total 922 938 700

Chromium Compact Language Detection (CLD)

bull Porting the language detector from

Google Chromium ndash httpcodegooglecompchromium-compact-language-detector

ndash Implementation in C++ Python binding

ndash of supported languages CLD = 76

langdetect = 53

ndash Accuracy CLD = 9882 langdetect =

9922

bull for 17 languages on Europarl datasets bull httpblogmikemccandlesscom201110accuracy-and-performance-of-googleshtml

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 20

Is twitter Language Detection difficult (1)

bull Tweet is too short to extract 3-gram features

ndash At most 140 characters on twitter

ndash URLs mentions and hashtags are not useful to

detect

bull LIGA [Tromp+ 11]

ndash Graph-features based on 3-gram

bull Add long distance features

bull 95~98 accuracy for twitter Language Detection

bull 6 languages (de en es fr it nl)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 21

Is twitter Language Detection difficult (2)

bull Tweet is too noisy

ndash Representations against the languages orthography often appear

ndash Acronym Abbreviation lengthened word (like Cooooolll)

bull Likelihood of tweet tends to get smaller on normal language model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

OMG Oh My God

LOL Laughing Out Loud

LMAO Laughing My Ass Out

F4F Follow for Follow

MDR Mort de Rire (French)

TKT Ne tlsquoInquiegravete Pas (Fr)

u you

ur your

4 for

i0u I love you

k che (Italian)

anke anche(Italian)

Letter k isnt used in Italian

22

Motivation to Detect Short Text Language

bull There are many small chunks of text in addition to twitter

ndash Schedule search query bulletin board and so on

ndash There are many questions about short text detection in the Issues Board of langdetect Project

bull httpcodegooglecomplanguage-detectionissuesdetailid=10

bull Detection for multi-language mixed text

ndash Cut the target document in paragraphs or lines

ndash Detect for each short text

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 23

Our Goal

bull Over 99 accuracy

ndash However it is too difficult to detect one

word sentence

ndash Our Goal is 99+ accurate detection for

sentence with more than 3 words

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 24

We need

bull Rich feature extractable model from

short text

ndash Maximal substring model

(infin-gram Logistic Regression)

bull and twitter-specific Language model

or Corpus to construct it

ndash about 700K tweet corpus with language

label

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 25

Proposal Method

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 26

How to increase features from 3-grams

bull The more n the more features

bull Maximum at n=infin that is all substring

ndash But it has O(T2) order

gram of n-gram

freq≧1 freq≧2 freq≧10

1 79 72 57

2 1896 1533 902

3 15970 10369 4525

4 64966 33941 10534

5 167543 69719 15538

6 323749 107861 18970

7 524634 142954 21093

8 760719 171995 22159

9 921361 193995 22696

cumulative distributuion of feature length for 5090 normalized English tweets (300KB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 27

Text Categorization with All Substring Features [Okanohara+ 09]

bull Multiclass Logistic Regression using all

substrings as features

ndash Maximal Substring makes the equivalent

model that can be constructed in linear

time

ndash Store features into TRIE fast prediction

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 28

Maximal Substring (1)

bull Define a containment(semi-order)

among non empty substrings

abracadabra

ndash ldquorardquo sub ldquobraldquo hArr all rdquorardquo occur

as the substring of ldquobrardquo

ndash ldquoardquo nsub ldquoraldquo hArr ldquoardquo occur in not only ldquoraldquo

but also ldquocardquo It is strictly defined with also its position in the substring

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 29

Maximal Substring (2)

bull Each equivalent class formed by the containment relationship has a unique maximal element that is named Maximal Substring

bull Maximal substrings of abracadabra are a abra and abracadabra

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 30

via httpdhatenanejpnokuno201202031328237067

Maximal Substring and Infinity-Gram

bull Frequencies of substrings that have a containment relationship always equal

bull In the model with linear combination of features it is possible to enclose the common feature values

bull Logistic regression with maximal substrings is equivalent to the one with infinity-grams

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Although the equivalence collapses for test set

we assumes that it can be approximated by a sufficiently large training set

31

Extended Suffix Array

bull Extended Suffix Array consists of

ndash SA=Suffix Array

ndash L=Longest Common Prefixes

ndash B=Burrows-Wheelers Transformed text

bull A maximal substring that occurs more than once corresponds to a internal node of Suffix Tree which is equivalent to a suffix with Lgt0 and BWT has more than 1 character type

ndash They can be calculated on linear time

bull esaxx Okanoharas implement of ESA

ndash httpcodegooglecompesaxx

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 32

via [Okanohara+ 09]

Corpus and Normalization

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 33

Target Languages

bull Limit character type to detect

ndash In short text detection mixed text can be

divided to type of characters

bull Latin alphabet language

ndash The most difficult alphabet type to detect

ndash Languages which speakers are over 5

million are more than 25

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 34

Whats Latin Alphabet

bull Latin alphabet ne ascii alphabet

ndash aring ą aelig eth Ħ ŋ and so on

bull They are assigned to 9 code blocks in Unicode

Range Name Supplement

U+0000-007F Basic Latin ascii

U+0080-00FF Latin-1 Supplement Most languages are covered with these U+0100-017F Latin Extended-A

U+0180-024F Latin Extended-B Rumanian

U+0250-02AF IPA Extensions

U+0300-036F Combining Diacritical Marks for tone symbol composition

U+1E00-1EFF Latin Extended Additional Vietnamese

U+2C60-2C7F Latin Extended-C These arenrsquot used by almost all present languages U+A720-A7FF Latin Extended-D

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 35

Latin Alphabets in Unicode Codepoint Chart

for Vietnamese only use often use sometimes

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 36

How to Create Corpus

bull Collect tweets with sample method of

twitter Streaming API

ndash Sampling 1 of all tweets (about 2

million tweets)

ndash Tweets in Latin alphabet language

account for 60 of them

bull The rest is only to annotate language

labels to these tweets

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 37

Language Label Annotation

bull Group tweets by their timezone

ndash French tweets account for about 1 of all ones

ndash But they account for 50 of ones in Paris

timezone only

bull Annotate tentative labels to tweets using

langdetect

ndash Remove non-French tweets from ones labeled lsquofrrsquo

ndash Recover French tweets from ones not labeled lsquofrrsquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 38

( 20 of the whole tweets have no timezone)

How to annotate

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 39

Swedish Norwegian Danish Vietnamese Lithuanian

Czech Hungarian Catalan Rumanian and Polish guides in turn

Created Corpus

bull Noiseless tweets for training data

bull Noiseful tweets with more than 3 words as test data

bull Work with Rauacutel Velaz and Hiroshi Manabe for Catalan corpus creation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 40

language training testca Catalan 9089 5082cs Czech 9082 7682da Dannish 7388 5524de German 44448 10065en English 44520 10168es Spanish 44118 10265fi Finnish 8087 7050fr French 44339 10098hu Hungarian 10030 4904id Indonesian 44722 10181it Italian 43366 10152nl Dutch 44682 10007no Norwegian 10124 8496pl Polish 16771 10152pt Portuguese 44215 10208ro Romanian 10021 5911sv Swedish 44054 10032tr Turkish 44703 10308vi Vietnamese 15030 10488

total 538789 166773

Simple Language Detection

bull Language detector can be constructed

from maximal substring model and

twitter corpus

ndash It still gets at most 98 accuracy

bull We guess it is necessary to reduce bias

ndash data size bias

ndash language-specific bias

ndash twitter-specific bias

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 41

Bias by Data Size

bull Tweet size in each language has huge bias

bull Level them out by sampling with replacement from each language up to the largest data

ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement

English

Portuguese

Spanish

Indonesian

Dutch

French

German

Turkish

Italian

Swedish

othersShort Text Language Detection with Infinity-Gram

(NAIST Seminar) 42

Convert to Lowercase on Multiple Languages

bull Conversion into lower case saves corpus and compresses model

bull But the lower case of I (U+0049) in Turkish differs from others

bull Convert to lower case excluding lsquoIrsquo

Upper case Lower case

Turkish

Azerbaijani

I (U+0049) ı (U+0131)

İ (U+0130) i (U+0069)

Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 43

Normalization for Rumanian

bull Rumanian uses acirc ă icirc ș ț in addition to a-z

bull There are 2 character type as st with a ldquobeardrdquo

ndash U+015E-F U+0162-3 st with cedilla

ndash U+0218-B st with comma below

bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia

bull The 2 code has the same design in some fonts

ndash Indistinguishable

ș ş U+0219 U+015F

ț ţ U+021B U+0163

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

44

Rumanian Character Affairs on PC

bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently

ndash 1989 Democratization in Rumania

ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode

ndash 2007 Rumania seated in the EU

ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 45

lsquost with cedillarsquo is used

on an advertisement board

in Bucharest

Normalization for Substitute Characters

bull lsquost with cedillarsquo are substitute characters

ndash But they are more popular than the others

ndash with cedilla with comma = 2 1

ndash ldquoRumanian IMErdquo outputs the substitutes too D

bull Regard lsquost with commarsquo as lsquost with cedillarsquo

ț ţ U+021B U+0163

I reckon it is similar to the relationship of

Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 46

Arabic Character Normalization (on language-detection)

bull Arabic and Persian have the similar trouble too

bull Character lsquoyehrsquo in Farsi corresponds to 2 code points

ndash Wikipedia uses ی (U+06cc Farsi yeh) only

ndash News uses ي(U+064a Arabic yeh) only

bull U+064a is a substitute in Farsi

ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc

ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails

bull Regard U+06cc as U+064a

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 47

Normalization for Vietnamese (1)

bull Vietnamese has 12 vowels

ndash a ă acirc e ecirc i y o ocirc ơ u ư

bull Vietnamese has 6 tones

ndash a ả agrave atilde aacute ạ

ndash These tone symbols are used also in general documents like news

bull The tone symbols can be appended to all vowels

ndash 12 6 = 72

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 48

Normalization for Vietnamese (2)

bull Representation of vowels with

tones

1 Use U+1ea0 - U+1ef9

bull ẵ = U+1eb5

2 Combine with Diacritical Marks

bull ẵ = U+0103 U+0303

ndash Half and half on news and tweet

bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 49

CJK-Kanji Normalization (1) (on language-detection)

bull CJK-Kanji has too many characters(more than 20K)

ndash Other character types have only 30-50 characters

bull The character space is very sparse

ndash Characters that donrsquot occur in the training corpus have no probabilities

bull eg 谢谢 Kanji for person name

ndash Common frequent characters are too strong

bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese

bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 50

CJK-Kanji Normalization (2) (on language-detection)

bull Group Kanjis by frequency and normalize each group to the representative character

ndash (1) K-means clustering

bull Use tf-idf on Wikipedia and Google News

bull K=50 (size of ascii alphabet = 52)

ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese

bull Simplified Chinese 现代汉语常用字表(3500)

bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)

bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998

ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much

bull Generate 130 clusters from product of (1) and (2)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 51

Normalization for twitter

bull Remove simply

ndash URL

ndash mention

ndash hash tag

ndash RT

ndash face mark using alphabet like XD p

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 52

Normalization for twitter-Specific Representation

bull How to Like lsquocoooooooollllllrsquo

bull Case 1 Make a normalization dictionary using [Brody+ 2011]

ndash Unsupervised normalization like coooollll rarr cool

ndash It canrsquot handle words that are not in the dictionary

bull Case 2 If the same character continues in more than 3 Shrink it to 2

ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of

bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on

bull Acronym (like WWW СССР) is not useful for language detection

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 53

Laugh Normalization

bull There are various laughs on each language

ndash HOW MUCH DO YOU LOVE COACH BEISTE

HHAHAHAHAHAH

ndash Hihihihi ) Habe ich regulaumlr 2x die Woche

ndash Tafil con eso Jajajajajajaja

ndash Malo Jejejeje XP

ndash kekeke chỗ đoacute lagravem aacuteo được ko em

bull Shrink them to double

ndash hahahha rArr haha

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 54

Implementation and Estimation

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 55

Language Detection with Infinity-Gram (ldig)

bull tweet language detection for Latin

alphabet

ndash httpsgithubcomshuyoldig

bull MIT license

bull Distribute also the trained model here

ndash infin-gram LR(maximal substring) [Okanohara+ 09]

ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]

ndash Double Array

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 56

Usage (1) Model Initialization

bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]

ndash Extract features from corpus and initialize model

ndash -m model directory

ndash -x path of maximal substring extractor (execute as external process)

ndash --ff Ignore less than the specified value

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 57

Maximal String Extractor

bull maxsubst [input file] [output file]

ndash Input as multiple line text

bull Replace TABs to ldquo ldquo line feeds to U+0001 in it

ndash Output as rdquo[features]yent[frequency]rdquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 58

Usage (2) Learn

bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]

ndash Learn the model using the corpus on 1 cycle of SGD

ndash -e learning rate of SGD

ndash -r regularizer of L1 regularization

ndash --wr what times to regularize for whole parameters

bull Parameters are too many to regularize the whle ones every step

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 59

Usage (3) Shrink Model

bull ldigpy -m [model] --shrink

ndash Remove Unefficient features(all

parameters of which are 0) from the

model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 60

Usage (4) Detect Language

bull ldigpy -m [model] [test data]

ndash Detect languages of test data and output

its result and summary

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 61

Data Format

bull Training and test data

ndash [correct label]yent[meta data]yent[text]

en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD

62

Usage (5) Estimation Tool

bull serverpy -m [model] -p [port number]

ndash Open httplocalhost[port] after it is executed

ndash Output their language probabilities contained features and their parameters for a text inputed in the text area

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 63

Estimation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus

As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size

64

language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992

total 166711 165053 9901 922 974

Estimation for LIGA dataset

bull Estimate using LIGA[Tromp+ 11] dataset

with 9066 tweets for 6 languages

ndash httpwwwwintuenl~mpechenprojectssmm

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Use 19 language model

65

Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996

total 9066 8992 992

Estimation for Europarl Dataset

Only supported languages for ldig

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 66

ldig langdetect CLDlanguage size correct rate correct rate correct rate

bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993

total 21000 13957 997 20850 993 20814 991

Conclusions

bull Language detector using maximal substring model

ndash Detect over 99 accuracy for 19 languages

ndash langdetect with tweet corpus even has 97 accuracy

bull If the corpus is maintained the precision will be still up

ndash There are still many mistakes (in particular da and no)

bull If metadata is added to features the precision will be still up

ndash How to add and train metadata at low cost

bull Desire to shrink the model without loss of precision

ndash Too large for application (gt100MB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 67

References

bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定

bull [Okanohara+ 09] Text Categorization with All Substring Features

bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs

bull [Cavnar+ 94] N-Gram-Based Text Categorization

bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 68

Page 15: Short Text Language Detection with Infinity-Gram

Language Detection has been over isnt it

17

We still have ENEMY to beat

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 18

Twitter Language Detection with the Existing Methods

bull Only 90-95 accuracy

for tweet corpus

bull LD = language-detection

bull CLD = Chromium Compact Language

Detection

ndash httpcodegooglecompchromium-

compact-language-detector

ndash regard ms(Malay) as id(Indonesian)

bull Tika = Apache Tika

ndash httptikaapacheorg

ndash Estimate on 15 languages which Tika

supports in our tweet corpus

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 19

language LD CLD Tikaca Catalan 953 930 838cs Czech 963 966 ----da Dannish 945 907 587de German 866 968 731en English 883 974 547es Spanish 915 905 444fi Finnish 989 994 948fr French 950 945 674hu Hungarian 858 890 762id Indonesian 897 928 ----it Italian 962 938 871nl Dutch 695 932 650no Norwegian 960 749 686pl Polish 980 978 888pt Portuguese 880 886 474ro Romanian 928 961 826sv Swedish 960 964 756tr Turkish 976 974 ----vi Vietnamese 987 989 ----

total 922 938 700

Chromium Compact Language Detection (CLD)

bull Porting the language detector from

Google Chromium ndash httpcodegooglecompchromium-compact-language-detector

ndash Implementation in C++ Python binding

ndash of supported languages CLD = 76

langdetect = 53

ndash Accuracy CLD = 9882 langdetect =

9922

bull for 17 languages on Europarl datasets bull httpblogmikemccandlesscom201110accuracy-and-performance-of-googleshtml

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 20

Is twitter Language Detection difficult (1)

bull Tweet is too short to extract 3-gram features

ndash At most 140 characters on twitter

ndash URLs mentions and hashtags are not useful to

detect

bull LIGA [Tromp+ 11]

ndash Graph-features based on 3-gram

bull Add long distance features

bull 95~98 accuracy for twitter Language Detection

bull 6 languages (de en es fr it nl)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 21

Is twitter Language Detection difficult (2)

bull Tweet is too noisy

ndash Representations against the languages orthography often appear

ndash Acronym Abbreviation lengthened word (like Cooooolll)

bull Likelihood of tweet tends to get smaller on normal language model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

OMG Oh My God

LOL Laughing Out Loud

LMAO Laughing My Ass Out

F4F Follow for Follow

MDR Mort de Rire (French)

TKT Ne tlsquoInquiegravete Pas (Fr)

u you

ur your

4 for

i0u I love you

k che (Italian)

anke anche(Italian)

Letter k isnt used in Italian

22

Motivation to Detect Short Text Language

bull There are many small chunks of text in addition to twitter

ndash Schedule search query bulletin board and so on

ndash There are many questions about short text detection in the Issues Board of langdetect Project

bull httpcodegooglecomplanguage-detectionissuesdetailid=10

bull Detection for multi-language mixed text

ndash Cut the target document in paragraphs or lines

ndash Detect for each short text

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 23

Our Goal

bull Over 99 accuracy

ndash However it is too difficult to detect one

word sentence

ndash Our Goal is 99+ accurate detection for

sentence with more than 3 words

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 24

We need

bull Rich feature extractable model from

short text

ndash Maximal substring model

(infin-gram Logistic Regression)

bull and twitter-specific Language model

or Corpus to construct it

ndash about 700K tweet corpus with language

label

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 25

Proposal Method

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 26

How to increase features from 3-grams

bull The more n the more features

bull Maximum at n=infin that is all substring

ndash But it has O(T2) order

gram of n-gram

freq≧1 freq≧2 freq≧10

1 79 72 57

2 1896 1533 902

3 15970 10369 4525

4 64966 33941 10534

5 167543 69719 15538

6 323749 107861 18970

7 524634 142954 21093

8 760719 171995 22159

9 921361 193995 22696

cumulative distributuion of feature length for 5090 normalized English tweets (300KB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 27

Text Categorization with All Substring Features [Okanohara+ 09]

bull Multiclass Logistic Regression using all

substrings as features

ndash Maximal Substring makes the equivalent

model that can be constructed in linear

time

ndash Store features into TRIE fast prediction

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 28

Maximal Substring (1)

bull Define a containment(semi-order)

among non empty substrings

abracadabra

ndash ldquorardquo sub ldquobraldquo hArr all rdquorardquo occur

as the substring of ldquobrardquo

ndash ldquoardquo nsub ldquoraldquo hArr ldquoardquo occur in not only ldquoraldquo

but also ldquocardquo It is strictly defined with also its position in the substring

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 29

Maximal Substring (2)

bull Each equivalent class formed by the containment relationship has a unique maximal element that is named Maximal Substring

bull Maximal substrings of abracadabra are a abra and abracadabra

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 30

via httpdhatenanejpnokuno201202031328237067

Maximal Substring and Infinity-Gram

bull Frequencies of substrings that have a containment relationship always equal

bull In the model with linear combination of features it is possible to enclose the common feature values

bull Logistic regression with maximal substrings is equivalent to the one with infinity-grams

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Although the equivalence collapses for test set

we assumes that it can be approximated by a sufficiently large training set

31

Extended Suffix Array

bull Extended Suffix Array consists of

ndash SA=Suffix Array

ndash L=Longest Common Prefixes

ndash B=Burrows-Wheelers Transformed text

bull A maximal substring that occurs more than once corresponds to a internal node of Suffix Tree which is equivalent to a suffix with Lgt0 and BWT has more than 1 character type

ndash They can be calculated on linear time

bull esaxx Okanoharas implement of ESA

ndash httpcodegooglecompesaxx

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 32

via [Okanohara+ 09]

Corpus and Normalization

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 33

Target Languages

bull Limit character type to detect

ndash In short text detection mixed text can be

divided to type of characters

bull Latin alphabet language

ndash The most difficult alphabet type to detect

ndash Languages which speakers are over 5

million are more than 25

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 34

Whats Latin Alphabet

bull Latin alphabet ne ascii alphabet

ndash aring ą aelig eth Ħ ŋ and so on

bull They are assigned to 9 code blocks in Unicode

Range Name Supplement

U+0000-007F Basic Latin ascii

U+0080-00FF Latin-1 Supplement Most languages are covered with these U+0100-017F Latin Extended-A

U+0180-024F Latin Extended-B Rumanian

U+0250-02AF IPA Extensions

U+0300-036F Combining Diacritical Marks for tone symbol composition

U+1E00-1EFF Latin Extended Additional Vietnamese

U+2C60-2C7F Latin Extended-C These arenrsquot used by almost all present languages U+A720-A7FF Latin Extended-D

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 35

Latin Alphabets in Unicode Codepoint Chart

for Vietnamese only use often use sometimes

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 36

How to Create Corpus

bull Collect tweets with sample method of

twitter Streaming API

ndash Sampling 1 of all tweets (about 2

million tweets)

ndash Tweets in Latin alphabet language

account for 60 of them

bull The rest is only to annotate language

labels to these tweets

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 37

Language Label Annotation

bull Group tweets by their timezone

ndash French tweets account for about 1 of all ones

ndash But they account for 50 of ones in Paris

timezone only

bull Annotate tentative labels to tweets using

langdetect

ndash Remove non-French tweets from ones labeled lsquofrrsquo

ndash Recover French tweets from ones not labeled lsquofrrsquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 38

( 20 of the whole tweets have no timezone)

How to annotate

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 39

Swedish Norwegian Danish Vietnamese Lithuanian

Czech Hungarian Catalan Rumanian and Polish guides in turn

Created Corpus

bull Noiseless tweets for training data

bull Noiseful tweets with more than 3 words as test data

bull Work with Rauacutel Velaz and Hiroshi Manabe for Catalan corpus creation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 40

language training testca Catalan 9089 5082cs Czech 9082 7682da Dannish 7388 5524de German 44448 10065en English 44520 10168es Spanish 44118 10265fi Finnish 8087 7050fr French 44339 10098hu Hungarian 10030 4904id Indonesian 44722 10181it Italian 43366 10152nl Dutch 44682 10007no Norwegian 10124 8496pl Polish 16771 10152pt Portuguese 44215 10208ro Romanian 10021 5911sv Swedish 44054 10032tr Turkish 44703 10308vi Vietnamese 15030 10488

total 538789 166773

Simple Language Detection

bull Language detector can be constructed

from maximal substring model and

twitter corpus

ndash It still gets at most 98 accuracy

bull We guess it is necessary to reduce bias

ndash data size bias

ndash language-specific bias

ndash twitter-specific bias

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 41

Bias by Data Size

bull Tweet size in each language has huge bias

bull Level them out by sampling with replacement from each language up to the largest data

ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement

English

Portuguese

Spanish

Indonesian

Dutch

French

German

Turkish

Italian

Swedish

othersShort Text Language Detection with Infinity-Gram

(NAIST Seminar) 42

Convert to Lowercase on Multiple Languages

bull Conversion into lower case saves corpus and compresses model

bull But the lower case of I (U+0049) in Turkish differs from others

bull Convert to lower case excluding lsquoIrsquo

Upper case Lower case

Turkish

Azerbaijani

I (U+0049) ı (U+0131)

İ (U+0130) i (U+0069)

Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 43

Normalization for Rumanian

bull Rumanian uses acirc ă icirc ș ț in addition to a-z

bull There are 2 character type as st with a ldquobeardrdquo

ndash U+015E-F U+0162-3 st with cedilla

ndash U+0218-B st with comma below

bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia

bull The 2 code has the same design in some fonts

ndash Indistinguishable

ș ş U+0219 U+015F

ț ţ U+021B U+0163

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

44

Rumanian Character Affairs on PC

bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently

ndash 1989 Democratization in Rumania

ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode

ndash 2007 Rumania seated in the EU

ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 45

lsquost with cedillarsquo is used

on an advertisement board

in Bucharest

Normalization for Substitute Characters

bull lsquost with cedillarsquo are substitute characters

ndash But they are more popular than the others

ndash with cedilla with comma = 2 1

ndash ldquoRumanian IMErdquo outputs the substitutes too D

bull Regard lsquost with commarsquo as lsquost with cedillarsquo

ț ţ U+021B U+0163

I reckon it is similar to the relationship of

Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 46

Arabic Character Normalization (on language-detection)

bull Arabic and Persian have the similar trouble too

bull Character lsquoyehrsquo in Farsi corresponds to 2 code points

ndash Wikipedia uses ی (U+06cc Farsi yeh) only

ndash News uses ي(U+064a Arabic yeh) only

bull U+064a is a substitute in Farsi

ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc

ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails

bull Regard U+06cc as U+064a

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 47

Normalization for Vietnamese (1)

bull Vietnamese has 12 vowels

ndash a ă acirc e ecirc i y o ocirc ơ u ư

bull Vietnamese has 6 tones

ndash a ả agrave atilde aacute ạ

ndash These tone symbols are used also in general documents like news

bull The tone symbols can be appended to all vowels

ndash 12 6 = 72

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 48

Normalization for Vietnamese (2)

bull Representation of vowels with

tones

1 Use U+1ea0 - U+1ef9

bull ẵ = U+1eb5

2 Combine with Diacritical Marks

bull ẵ = U+0103 U+0303

ndash Half and half on news and tweet

bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 49

CJK-Kanji Normalization (1) (on language-detection)

bull CJK-Kanji has too many characters(more than 20K)

ndash Other character types have only 30-50 characters

bull The character space is very sparse

ndash Characters that donrsquot occur in the training corpus have no probabilities

bull eg 谢谢 Kanji for person name

ndash Common frequent characters are too strong

bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese

bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 50

CJK-Kanji Normalization (2) (on language-detection)

bull Group Kanjis by frequency and normalize each group to the representative character

ndash (1) K-means clustering

bull Use tf-idf on Wikipedia and Google News

bull K=50 (size of ascii alphabet = 52)

ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese

bull Simplified Chinese 现代汉语常用字表(3500)

bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)

bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998

ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much

bull Generate 130 clusters from product of (1) and (2)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 51

Normalization for twitter

bull Remove simply

ndash URL

ndash mention

ndash hash tag

ndash RT

ndash face mark using alphabet like XD p

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 52

Normalization for twitter-Specific Representation

bull How to Like lsquocoooooooollllllrsquo

bull Case 1 Make a normalization dictionary using [Brody+ 2011]

ndash Unsupervised normalization like coooollll rarr cool

ndash It canrsquot handle words that are not in the dictionary

bull Case 2 If the same character continues in more than 3 Shrink it to 2

ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of

bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on

bull Acronym (like WWW СССР) is not useful for language detection

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 53

Laugh Normalization

bull There are various laughs on each language

ndash HOW MUCH DO YOU LOVE COACH BEISTE

HHAHAHAHAHAH

ndash Hihihihi ) Habe ich regulaumlr 2x die Woche

ndash Tafil con eso Jajajajajajaja

ndash Malo Jejejeje XP

ndash kekeke chỗ đoacute lagravem aacuteo được ko em

bull Shrink them to double

ndash hahahha rArr haha

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 54

Implementation and Estimation

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 55

Language Detection with Infinity-Gram (ldig)

bull tweet language detection for Latin

alphabet

ndash httpsgithubcomshuyoldig

bull MIT license

bull Distribute also the trained model here

ndash infin-gram LR(maximal substring) [Okanohara+ 09]

ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]

ndash Double Array

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 56

Usage (1) Model Initialization

bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]

ndash Extract features from corpus and initialize model

ndash -m model directory

ndash -x path of maximal substring extractor (execute as external process)

ndash --ff Ignore less than the specified value

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 57

Maximal String Extractor

bull maxsubst [input file] [output file]

ndash Input as multiple line text

bull Replace TABs to ldquo ldquo line feeds to U+0001 in it

ndash Output as rdquo[features]yent[frequency]rdquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 58

Usage (2) Learn

bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]

ndash Learn the model using the corpus on 1 cycle of SGD

ndash -e learning rate of SGD

ndash -r regularizer of L1 regularization

ndash --wr what times to regularize for whole parameters

bull Parameters are too many to regularize the whle ones every step

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 59

Usage (3) Shrink Model

bull ldigpy -m [model] --shrink

ndash Remove Unefficient features(all

parameters of which are 0) from the

model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 60

Usage (4) Detect Language

bull ldigpy -m [model] [test data]

ndash Detect languages of test data and output

its result and summary

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 61

Data Format

bull Training and test data

ndash [correct label]yent[meta data]yent[text]

en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD

62

Usage (5) Estimation Tool

bull serverpy -m [model] -p [port number]

ndash Open httplocalhost[port] after it is executed

ndash Output their language probabilities contained features and their parameters for a text inputed in the text area

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 63

Estimation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus

As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size

64

language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992

total 166711 165053 9901 922 974

Estimation for LIGA dataset

bull Estimate using LIGA[Tromp+ 11] dataset

with 9066 tweets for 6 languages

ndash httpwwwwintuenl~mpechenprojectssmm

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Use 19 language model

65

Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996

total 9066 8992 992

Estimation for Europarl Dataset

Only supported languages for ldig

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 66

ldig langdetect CLDlanguage size correct rate correct rate correct rate

bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993

total 21000 13957 997 20850 993 20814 991

Conclusions

bull Language detector using maximal substring model

ndash Detect over 99 accuracy for 19 languages

ndash langdetect with tweet corpus even has 97 accuracy

bull If the corpus is maintained the precision will be still up

ndash There are still many mistakes (in particular da and no)

bull If metadata is added to features the precision will be still up

ndash How to add and train metadata at low cost

bull Desire to shrink the model without loss of precision

ndash Too large for application (gt100MB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 67

References

bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定

bull [Okanohara+ 09] Text Categorization with All Substring Features

bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs

bull [Cavnar+ 94] N-Gram-Based Text Categorization

bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 68

Page 16: Short Text Language Detection with Infinity-Gram

We still have ENEMY to beat

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 18

Twitter Language Detection with the Existing Methods

bull Only 90-95 accuracy

for tweet corpus

bull LD = language-detection

bull CLD = Chromium Compact Language

Detection

ndash httpcodegooglecompchromium-

compact-language-detector

ndash regard ms(Malay) as id(Indonesian)

bull Tika = Apache Tika

ndash httptikaapacheorg

ndash Estimate on 15 languages which Tika

supports in our tweet corpus

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 19

language LD CLD Tikaca Catalan 953 930 838cs Czech 963 966 ----da Dannish 945 907 587de German 866 968 731en English 883 974 547es Spanish 915 905 444fi Finnish 989 994 948fr French 950 945 674hu Hungarian 858 890 762id Indonesian 897 928 ----it Italian 962 938 871nl Dutch 695 932 650no Norwegian 960 749 686pl Polish 980 978 888pt Portuguese 880 886 474ro Romanian 928 961 826sv Swedish 960 964 756tr Turkish 976 974 ----vi Vietnamese 987 989 ----

total 922 938 700

Chromium Compact Language Detection (CLD)

bull Porting the language detector from

Google Chromium ndash httpcodegooglecompchromium-compact-language-detector

ndash Implementation in C++ Python binding

ndash of supported languages CLD = 76

langdetect = 53

ndash Accuracy CLD = 9882 langdetect =

9922

bull for 17 languages on Europarl datasets bull httpblogmikemccandlesscom201110accuracy-and-performance-of-googleshtml

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 20

Is twitter Language Detection difficult (1)

bull Tweet is too short to extract 3-gram features

ndash At most 140 characters on twitter

ndash URLs mentions and hashtags are not useful to

detect

bull LIGA [Tromp+ 11]

ndash Graph-features based on 3-gram

bull Add long distance features

bull 95~98 accuracy for twitter Language Detection

bull 6 languages (de en es fr it nl)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 21

Is twitter Language Detection difficult (2)

bull Tweet is too noisy

ndash Representations against the languages orthography often appear

ndash Acronym Abbreviation lengthened word (like Cooooolll)

bull Likelihood of tweet tends to get smaller on normal language model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

OMG Oh My God

LOL Laughing Out Loud

LMAO Laughing My Ass Out

F4F Follow for Follow

MDR Mort de Rire (French)

TKT Ne tlsquoInquiegravete Pas (Fr)

u you

ur your

4 for

i0u I love you

k che (Italian)

anke anche(Italian)

Letter k isnt used in Italian

22

Motivation to Detect Short Text Language

bull There are many small chunks of text in addition to twitter

ndash Schedule search query bulletin board and so on

ndash There are many questions about short text detection in the Issues Board of langdetect Project

bull httpcodegooglecomplanguage-detectionissuesdetailid=10

bull Detection for multi-language mixed text

ndash Cut the target document in paragraphs or lines

ndash Detect for each short text

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 23

Our Goal

bull Over 99 accuracy

ndash However it is too difficult to detect one

word sentence

ndash Our Goal is 99+ accurate detection for

sentence with more than 3 words

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 24

We need

bull Rich feature extractable model from

short text

ndash Maximal substring model

(infin-gram Logistic Regression)

bull and twitter-specific Language model

or Corpus to construct it

ndash about 700K tweet corpus with language

label

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 25

Proposal Method

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 26

How to increase features from 3-grams

bull The more n the more features

bull Maximum at n=infin that is all substring

ndash But it has O(T2) order

gram of n-gram

freq≧1 freq≧2 freq≧10

1 79 72 57

2 1896 1533 902

3 15970 10369 4525

4 64966 33941 10534

5 167543 69719 15538

6 323749 107861 18970

7 524634 142954 21093

8 760719 171995 22159

9 921361 193995 22696

cumulative distributuion of feature length for 5090 normalized English tweets (300KB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 27

Text Categorization with All Substring Features [Okanohara+ 09]

bull Multiclass Logistic Regression using all

substrings as features

ndash Maximal Substring makes the equivalent

model that can be constructed in linear

time

ndash Store features into TRIE fast prediction

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 28

Maximal Substring (1)

bull Define a containment(semi-order)

among non empty substrings

abracadabra

ndash ldquorardquo sub ldquobraldquo hArr all rdquorardquo occur

as the substring of ldquobrardquo

ndash ldquoardquo nsub ldquoraldquo hArr ldquoardquo occur in not only ldquoraldquo

but also ldquocardquo It is strictly defined with also its position in the substring

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 29

Maximal Substring (2)

bull Each equivalent class formed by the containment relationship has a unique maximal element that is named Maximal Substring

bull Maximal substrings of abracadabra are a abra and abracadabra

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 30

via httpdhatenanejpnokuno201202031328237067

Maximal Substring and Infinity-Gram

bull Frequencies of substrings that have a containment relationship always equal

bull In the model with linear combination of features it is possible to enclose the common feature values

bull Logistic regression with maximal substrings is equivalent to the one with infinity-grams

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Although the equivalence collapses for test set

we assumes that it can be approximated by a sufficiently large training set

31

Extended Suffix Array

bull Extended Suffix Array consists of

ndash SA=Suffix Array

ndash L=Longest Common Prefixes

ndash B=Burrows-Wheelers Transformed text

bull A maximal substring that occurs more than once corresponds to a internal node of Suffix Tree which is equivalent to a suffix with Lgt0 and BWT has more than 1 character type

ndash They can be calculated on linear time

bull esaxx Okanoharas implement of ESA

ndash httpcodegooglecompesaxx

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 32

via [Okanohara+ 09]

Corpus and Normalization

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 33

Target Languages

bull Limit character type to detect

ndash In short text detection mixed text can be

divided to type of characters

bull Latin alphabet language

ndash The most difficult alphabet type to detect

ndash Languages which speakers are over 5

million are more than 25

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 34

Whats Latin Alphabet

bull Latin alphabet ne ascii alphabet

ndash aring ą aelig eth Ħ ŋ and so on

bull They are assigned to 9 code blocks in Unicode

Range Name Supplement

U+0000-007F Basic Latin ascii

U+0080-00FF Latin-1 Supplement Most languages are covered with these U+0100-017F Latin Extended-A

U+0180-024F Latin Extended-B Rumanian

U+0250-02AF IPA Extensions

U+0300-036F Combining Diacritical Marks for tone symbol composition

U+1E00-1EFF Latin Extended Additional Vietnamese

U+2C60-2C7F Latin Extended-C These arenrsquot used by almost all present languages U+A720-A7FF Latin Extended-D

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 35

Latin Alphabets in Unicode Codepoint Chart

for Vietnamese only use often use sometimes

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 36

How to Create Corpus

bull Collect tweets with sample method of

twitter Streaming API

ndash Sampling 1 of all tweets (about 2

million tweets)

ndash Tweets in Latin alphabet language

account for 60 of them

bull The rest is only to annotate language

labels to these tweets

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 37

Language Label Annotation

bull Group tweets by their timezone

ndash French tweets account for about 1 of all ones

ndash But they account for 50 of ones in Paris

timezone only

bull Annotate tentative labels to tweets using

langdetect

ndash Remove non-French tweets from ones labeled lsquofrrsquo

ndash Recover French tweets from ones not labeled lsquofrrsquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 38

( 20 of the whole tweets have no timezone)

How to annotate

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 39

Swedish Norwegian Danish Vietnamese Lithuanian

Czech Hungarian Catalan Rumanian and Polish guides in turn

Created Corpus

bull Noiseless tweets for training data

bull Noiseful tweets with more than 3 words as test data

bull Work with Rauacutel Velaz and Hiroshi Manabe for Catalan corpus creation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 40

language training testca Catalan 9089 5082cs Czech 9082 7682da Dannish 7388 5524de German 44448 10065en English 44520 10168es Spanish 44118 10265fi Finnish 8087 7050fr French 44339 10098hu Hungarian 10030 4904id Indonesian 44722 10181it Italian 43366 10152nl Dutch 44682 10007no Norwegian 10124 8496pl Polish 16771 10152pt Portuguese 44215 10208ro Romanian 10021 5911sv Swedish 44054 10032tr Turkish 44703 10308vi Vietnamese 15030 10488

total 538789 166773

Simple Language Detection

bull Language detector can be constructed

from maximal substring model and

twitter corpus

ndash It still gets at most 98 accuracy

bull We guess it is necessary to reduce bias

ndash data size bias

ndash language-specific bias

ndash twitter-specific bias

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 41

Bias by Data Size

bull Tweet size in each language has huge bias

bull Level them out by sampling with replacement from each language up to the largest data

ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement

English

Portuguese

Spanish

Indonesian

Dutch

French

German

Turkish

Italian

Swedish

othersShort Text Language Detection with Infinity-Gram

(NAIST Seminar) 42

Convert to Lowercase on Multiple Languages

bull Conversion into lower case saves corpus and compresses model

bull But the lower case of I (U+0049) in Turkish differs from others

bull Convert to lower case excluding lsquoIrsquo

Upper case Lower case

Turkish

Azerbaijani

I (U+0049) ı (U+0131)

İ (U+0130) i (U+0069)

Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 43

Normalization for Rumanian

bull Rumanian uses acirc ă icirc ș ț in addition to a-z

bull There are 2 character type as st with a ldquobeardrdquo

ndash U+015E-F U+0162-3 st with cedilla

ndash U+0218-B st with comma below

bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia

bull The 2 code has the same design in some fonts

ndash Indistinguishable

ș ş U+0219 U+015F

ț ţ U+021B U+0163

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

44

Rumanian Character Affairs on PC

bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently

ndash 1989 Democratization in Rumania

ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode

ndash 2007 Rumania seated in the EU

ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 45

lsquost with cedillarsquo is used

on an advertisement board

in Bucharest

Normalization for Substitute Characters

bull lsquost with cedillarsquo are substitute characters

ndash But they are more popular than the others

ndash with cedilla with comma = 2 1

ndash ldquoRumanian IMErdquo outputs the substitutes too D

bull Regard lsquost with commarsquo as lsquost with cedillarsquo

ț ţ U+021B U+0163

I reckon it is similar to the relationship of

Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 46

Arabic Character Normalization (on language-detection)

bull Arabic and Persian have the similar trouble too

bull Character lsquoyehrsquo in Farsi corresponds to 2 code points

ndash Wikipedia uses ی (U+06cc Farsi yeh) only

ndash News uses ي(U+064a Arabic yeh) only

bull U+064a is a substitute in Farsi

ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc

ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails

bull Regard U+06cc as U+064a

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 47

Normalization for Vietnamese (1)

bull Vietnamese has 12 vowels

ndash a ă acirc e ecirc i y o ocirc ơ u ư

bull Vietnamese has 6 tones

ndash a ả agrave atilde aacute ạ

ndash These tone symbols are used also in general documents like news

bull The tone symbols can be appended to all vowels

ndash 12 6 = 72

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 48

Normalization for Vietnamese (2)

bull Representation of vowels with

tones

1 Use U+1ea0 - U+1ef9

bull ẵ = U+1eb5

2 Combine with Diacritical Marks

bull ẵ = U+0103 U+0303

ndash Half and half on news and tweet

bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 49

CJK-Kanji Normalization (1) (on language-detection)

bull CJK-Kanji has too many characters(more than 20K)

ndash Other character types have only 30-50 characters

bull The character space is very sparse

ndash Characters that donrsquot occur in the training corpus have no probabilities

bull eg 谢谢 Kanji for person name

ndash Common frequent characters are too strong

bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese

bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 50

CJK-Kanji Normalization (2) (on language-detection)

bull Group Kanjis by frequency and normalize each group to the representative character

ndash (1) K-means clustering

bull Use tf-idf on Wikipedia and Google News

bull K=50 (size of ascii alphabet = 52)

ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese

bull Simplified Chinese 现代汉语常用字表(3500)

bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)

bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998

ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much

bull Generate 130 clusters from product of (1) and (2)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 51

Normalization for twitter

bull Remove simply

ndash URL

ndash mention

ndash hash tag

ndash RT

ndash face mark using alphabet like XD p

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 52

Normalization for twitter-Specific Representation

bull How to Like lsquocoooooooollllllrsquo

bull Case 1 Make a normalization dictionary using [Brody+ 2011]

ndash Unsupervised normalization like coooollll rarr cool

ndash It canrsquot handle words that are not in the dictionary

bull Case 2 If the same character continues in more than 3 Shrink it to 2

ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of

bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on

bull Acronym (like WWW СССР) is not useful for language detection

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 53

Laugh Normalization

bull There are various laughs on each language

ndash HOW MUCH DO YOU LOVE COACH BEISTE

HHAHAHAHAHAH

ndash Hihihihi ) Habe ich regulaumlr 2x die Woche

ndash Tafil con eso Jajajajajajaja

ndash Malo Jejejeje XP

ndash kekeke chỗ đoacute lagravem aacuteo được ko em

bull Shrink them to double

ndash hahahha rArr haha

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 54

Implementation and Estimation

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 55

Language Detection with Infinity-Gram (ldig)

bull tweet language detection for Latin

alphabet

ndash httpsgithubcomshuyoldig

bull MIT license

bull Distribute also the trained model here

ndash infin-gram LR(maximal substring) [Okanohara+ 09]

ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]

ndash Double Array

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 56

Usage (1) Model Initialization

bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]

ndash Extract features from corpus and initialize model

ndash -m model directory

ndash -x path of maximal substring extractor (execute as external process)

ndash --ff Ignore less than the specified value

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 57

Maximal String Extractor

bull maxsubst [input file] [output file]

ndash Input as multiple line text

bull Replace TABs to ldquo ldquo line feeds to U+0001 in it

ndash Output as rdquo[features]yent[frequency]rdquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 58

Usage (2) Learn

bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]

ndash Learn the model using the corpus on 1 cycle of SGD

ndash -e learning rate of SGD

ndash -r regularizer of L1 regularization

ndash --wr what times to regularize for whole parameters

bull Parameters are too many to regularize the whle ones every step

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 59

Usage (3) Shrink Model

bull ldigpy -m [model] --shrink

ndash Remove Unefficient features(all

parameters of which are 0) from the

model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 60

Usage (4) Detect Language

bull ldigpy -m [model] [test data]

ndash Detect languages of test data and output

its result and summary

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 61

Data Format

bull Training and test data

ndash [correct label]yent[meta data]yent[text]

en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD

62

Usage (5) Estimation Tool

bull serverpy -m [model] -p [port number]

ndash Open httplocalhost[port] after it is executed

ndash Output their language probabilities contained features and their parameters for a text inputed in the text area

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 63

Estimation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus

As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size

64

language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992

total 166711 165053 9901 922 974

Estimation for LIGA dataset

bull Estimate using LIGA[Tromp+ 11] dataset

with 9066 tweets for 6 languages

ndash httpwwwwintuenl~mpechenprojectssmm

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Use 19 language model

65

Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996

total 9066 8992 992

Estimation for Europarl Dataset

Only supported languages for ldig

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 66

ldig langdetect CLDlanguage size correct rate correct rate correct rate

bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993

total 21000 13957 997 20850 993 20814 991

Conclusions

bull Language detector using maximal substring model

ndash Detect over 99 accuracy for 19 languages

ndash langdetect with tweet corpus even has 97 accuracy

bull If the corpus is maintained the precision will be still up

ndash There are still many mistakes (in particular da and no)

bull If metadata is added to features the precision will be still up

ndash How to add and train metadata at low cost

bull Desire to shrink the model without loss of precision

ndash Too large for application (gt100MB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 67

References

bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定

bull [Okanohara+ 09] Text Categorization with All Substring Features

bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs

bull [Cavnar+ 94] N-Gram-Based Text Categorization

bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 68

Page 17: Short Text Language Detection with Infinity-Gram

Twitter Language Detection with the Existing Methods

bull Only 90-95 accuracy

for tweet corpus

bull LD = language-detection

bull CLD = Chromium Compact Language

Detection

ndash httpcodegooglecompchromium-

compact-language-detector

ndash regard ms(Malay) as id(Indonesian)

bull Tika = Apache Tika

ndash httptikaapacheorg

ndash Estimate on 15 languages which Tika

supports in our tweet corpus

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 19

language LD CLD Tikaca Catalan 953 930 838cs Czech 963 966 ----da Dannish 945 907 587de German 866 968 731en English 883 974 547es Spanish 915 905 444fi Finnish 989 994 948fr French 950 945 674hu Hungarian 858 890 762id Indonesian 897 928 ----it Italian 962 938 871nl Dutch 695 932 650no Norwegian 960 749 686pl Polish 980 978 888pt Portuguese 880 886 474ro Romanian 928 961 826sv Swedish 960 964 756tr Turkish 976 974 ----vi Vietnamese 987 989 ----

total 922 938 700

Chromium Compact Language Detection (CLD)

bull Porting the language detector from

Google Chromium ndash httpcodegooglecompchromium-compact-language-detector

ndash Implementation in C++ Python binding

ndash of supported languages CLD = 76

langdetect = 53

ndash Accuracy CLD = 9882 langdetect =

9922

bull for 17 languages on Europarl datasets bull httpblogmikemccandlesscom201110accuracy-and-performance-of-googleshtml

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 20

Is twitter Language Detection difficult (1)

bull Tweet is too short to extract 3-gram features

ndash At most 140 characters on twitter

ndash URLs mentions and hashtags are not useful to

detect

bull LIGA [Tromp+ 11]

ndash Graph-features based on 3-gram

bull Add long distance features

bull 95~98 accuracy for twitter Language Detection

bull 6 languages (de en es fr it nl)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 21

Is twitter Language Detection difficult (2)

bull Tweet is too noisy

ndash Representations against the languages orthography often appear

ndash Acronym Abbreviation lengthened word (like Cooooolll)

bull Likelihood of tweet tends to get smaller on normal language model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

OMG Oh My God

LOL Laughing Out Loud

LMAO Laughing My Ass Out

F4F Follow for Follow

MDR Mort de Rire (French)

TKT Ne tlsquoInquiegravete Pas (Fr)

u you

ur your

4 for

i0u I love you

k che (Italian)

anke anche(Italian)

Letter k isnt used in Italian

22

Motivation to Detect Short Text Language

bull There are many small chunks of text in addition to twitter

ndash Schedule search query bulletin board and so on

ndash There are many questions about short text detection in the Issues Board of langdetect Project

bull httpcodegooglecomplanguage-detectionissuesdetailid=10

bull Detection for multi-language mixed text

ndash Cut the target document in paragraphs or lines

ndash Detect for each short text

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 23

Our Goal

bull Over 99 accuracy

ndash However it is too difficult to detect one

word sentence

ndash Our Goal is 99+ accurate detection for

sentence with more than 3 words

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 24

We need

bull Rich feature extractable model from

short text

ndash Maximal substring model

(infin-gram Logistic Regression)

bull and twitter-specific Language model

or Corpus to construct it

ndash about 700K tweet corpus with language

label

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 25

Proposal Method

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 26

How to increase features from 3-grams

bull The more n the more features

bull Maximum at n=infin that is all substring

ndash But it has O(T2) order

gram of n-gram

freq≧1 freq≧2 freq≧10

1 79 72 57

2 1896 1533 902

3 15970 10369 4525

4 64966 33941 10534

5 167543 69719 15538

6 323749 107861 18970

7 524634 142954 21093

8 760719 171995 22159

9 921361 193995 22696

cumulative distributuion of feature length for 5090 normalized English tweets (300KB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 27

Text Categorization with All Substring Features [Okanohara+ 09]

bull Multiclass Logistic Regression using all

substrings as features

ndash Maximal Substring makes the equivalent

model that can be constructed in linear

time

ndash Store features into TRIE fast prediction

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 28

Maximal Substring (1)

bull Define a containment(semi-order)

among non empty substrings

abracadabra

ndash ldquorardquo sub ldquobraldquo hArr all rdquorardquo occur

as the substring of ldquobrardquo

ndash ldquoardquo nsub ldquoraldquo hArr ldquoardquo occur in not only ldquoraldquo

but also ldquocardquo It is strictly defined with also its position in the substring

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 29

Maximal Substring (2)

bull Each equivalent class formed by the containment relationship has a unique maximal element that is named Maximal Substring

bull Maximal substrings of abracadabra are a abra and abracadabra

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 30

via httpdhatenanejpnokuno201202031328237067

Maximal Substring and Infinity-Gram

bull Frequencies of substrings that have a containment relationship always equal

bull In the model with linear combination of features it is possible to enclose the common feature values

bull Logistic regression with maximal substrings is equivalent to the one with infinity-grams

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Although the equivalence collapses for test set

we assumes that it can be approximated by a sufficiently large training set

31

Extended Suffix Array

bull Extended Suffix Array consists of

ndash SA=Suffix Array

ndash L=Longest Common Prefixes

ndash B=Burrows-Wheelers Transformed text

bull A maximal substring that occurs more than once corresponds to a internal node of Suffix Tree which is equivalent to a suffix with Lgt0 and BWT has more than 1 character type

ndash They can be calculated on linear time

bull esaxx Okanoharas implement of ESA

ndash httpcodegooglecompesaxx

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 32

via [Okanohara+ 09]

Corpus and Normalization

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 33

Target Languages

bull Limit character type to detect

ndash In short text detection mixed text can be

divided to type of characters

bull Latin alphabet language

ndash The most difficult alphabet type to detect

ndash Languages which speakers are over 5

million are more than 25

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 34

Whats Latin Alphabet

bull Latin alphabet ne ascii alphabet

ndash aring ą aelig eth Ħ ŋ and so on

bull They are assigned to 9 code blocks in Unicode

Range Name Supplement

U+0000-007F Basic Latin ascii

U+0080-00FF Latin-1 Supplement Most languages are covered with these U+0100-017F Latin Extended-A

U+0180-024F Latin Extended-B Rumanian

U+0250-02AF IPA Extensions

U+0300-036F Combining Diacritical Marks for tone symbol composition

U+1E00-1EFF Latin Extended Additional Vietnamese

U+2C60-2C7F Latin Extended-C These arenrsquot used by almost all present languages U+A720-A7FF Latin Extended-D

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 35

Latin Alphabets in Unicode Codepoint Chart

for Vietnamese only use often use sometimes

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 36

How to Create Corpus

bull Collect tweets with sample method of

twitter Streaming API

ndash Sampling 1 of all tweets (about 2

million tweets)

ndash Tweets in Latin alphabet language

account for 60 of them

bull The rest is only to annotate language

labels to these tweets

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 37

Language Label Annotation

bull Group tweets by their timezone

ndash French tweets account for about 1 of all ones

ndash But they account for 50 of ones in Paris

timezone only

bull Annotate tentative labels to tweets using

langdetect

ndash Remove non-French tweets from ones labeled lsquofrrsquo

ndash Recover French tweets from ones not labeled lsquofrrsquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 38

( 20 of the whole tweets have no timezone)

How to annotate

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 39

Swedish Norwegian Danish Vietnamese Lithuanian

Czech Hungarian Catalan Rumanian and Polish guides in turn

Created Corpus

bull Noiseless tweets for training data

bull Noiseful tweets with more than 3 words as test data

bull Work with Rauacutel Velaz and Hiroshi Manabe for Catalan corpus creation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 40

language training testca Catalan 9089 5082cs Czech 9082 7682da Dannish 7388 5524de German 44448 10065en English 44520 10168es Spanish 44118 10265fi Finnish 8087 7050fr French 44339 10098hu Hungarian 10030 4904id Indonesian 44722 10181it Italian 43366 10152nl Dutch 44682 10007no Norwegian 10124 8496pl Polish 16771 10152pt Portuguese 44215 10208ro Romanian 10021 5911sv Swedish 44054 10032tr Turkish 44703 10308vi Vietnamese 15030 10488

total 538789 166773

Simple Language Detection

bull Language detector can be constructed

from maximal substring model and

twitter corpus

ndash It still gets at most 98 accuracy

bull We guess it is necessary to reduce bias

ndash data size bias

ndash language-specific bias

ndash twitter-specific bias

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 41

Bias by Data Size

bull Tweet size in each language has huge bias

bull Level them out by sampling with replacement from each language up to the largest data

ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement

English

Portuguese

Spanish

Indonesian

Dutch

French

German

Turkish

Italian

Swedish

othersShort Text Language Detection with Infinity-Gram

(NAIST Seminar) 42

Convert to Lowercase on Multiple Languages

bull Conversion into lower case saves corpus and compresses model

bull But the lower case of I (U+0049) in Turkish differs from others

bull Convert to lower case excluding lsquoIrsquo

Upper case Lower case

Turkish

Azerbaijani

I (U+0049) ı (U+0131)

İ (U+0130) i (U+0069)

Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 43

Normalization for Rumanian

bull Rumanian uses acirc ă icirc ș ț in addition to a-z

bull There are 2 character type as st with a ldquobeardrdquo

ndash U+015E-F U+0162-3 st with cedilla

ndash U+0218-B st with comma below

bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia

bull The 2 code has the same design in some fonts

ndash Indistinguishable

ș ş U+0219 U+015F

ț ţ U+021B U+0163

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

44

Rumanian Character Affairs on PC

bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently

ndash 1989 Democratization in Rumania

ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode

ndash 2007 Rumania seated in the EU

ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 45

lsquost with cedillarsquo is used

on an advertisement board

in Bucharest

Normalization for Substitute Characters

bull lsquost with cedillarsquo are substitute characters

ndash But they are more popular than the others

ndash with cedilla with comma = 2 1

ndash ldquoRumanian IMErdquo outputs the substitutes too D

bull Regard lsquost with commarsquo as lsquost with cedillarsquo

ț ţ U+021B U+0163

I reckon it is similar to the relationship of

Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 46

Arabic Character Normalization (on language-detection)

bull Arabic and Persian have the similar trouble too

bull Character lsquoyehrsquo in Farsi corresponds to 2 code points

ndash Wikipedia uses ی (U+06cc Farsi yeh) only

ndash News uses ي(U+064a Arabic yeh) only

bull U+064a is a substitute in Farsi

ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc

ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails

bull Regard U+06cc as U+064a

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 47

Normalization for Vietnamese (1)

bull Vietnamese has 12 vowels

ndash a ă acirc e ecirc i y o ocirc ơ u ư

bull Vietnamese has 6 tones

ndash a ả agrave atilde aacute ạ

ndash These tone symbols are used also in general documents like news

bull The tone symbols can be appended to all vowels

ndash 12 6 = 72

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 48

Normalization for Vietnamese (2)

bull Representation of vowels with

tones

1 Use U+1ea0 - U+1ef9

bull ẵ = U+1eb5

2 Combine with Diacritical Marks

bull ẵ = U+0103 U+0303

ndash Half and half on news and tweet

bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 49

CJK-Kanji Normalization (1) (on language-detection)

bull CJK-Kanji has too many characters(more than 20K)

ndash Other character types have only 30-50 characters

bull The character space is very sparse

ndash Characters that donrsquot occur in the training corpus have no probabilities

bull eg 谢谢 Kanji for person name

ndash Common frequent characters are too strong

bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese

bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 50

CJK-Kanji Normalization (2) (on language-detection)

bull Group Kanjis by frequency and normalize each group to the representative character

ndash (1) K-means clustering

bull Use tf-idf on Wikipedia and Google News

bull K=50 (size of ascii alphabet = 52)

ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese

bull Simplified Chinese 现代汉语常用字表(3500)

bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)

bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998

ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much

bull Generate 130 clusters from product of (1) and (2)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 51

Normalization for twitter

bull Remove simply

ndash URL

ndash mention

ndash hash tag

ndash RT

ndash face mark using alphabet like XD p

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 52

Normalization for twitter-Specific Representation

bull How to Like lsquocoooooooollllllrsquo

bull Case 1 Make a normalization dictionary using [Brody+ 2011]

ndash Unsupervised normalization like coooollll rarr cool

ndash It canrsquot handle words that are not in the dictionary

bull Case 2 If the same character continues in more than 3 Shrink it to 2

ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of

bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on

bull Acronym (like WWW СССР) is not useful for language detection

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 53

Laugh Normalization

bull There are various laughs on each language

ndash HOW MUCH DO YOU LOVE COACH BEISTE

HHAHAHAHAHAH

ndash Hihihihi ) Habe ich regulaumlr 2x die Woche

ndash Tafil con eso Jajajajajajaja

ndash Malo Jejejeje XP

ndash kekeke chỗ đoacute lagravem aacuteo được ko em

bull Shrink them to double

ndash hahahha rArr haha

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 54

Implementation and Estimation

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 55

Language Detection with Infinity-Gram (ldig)

bull tweet language detection for Latin

alphabet

ndash httpsgithubcomshuyoldig

bull MIT license

bull Distribute also the trained model here

ndash infin-gram LR(maximal substring) [Okanohara+ 09]

ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]

ndash Double Array

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 56

Usage (1) Model Initialization

bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]

ndash Extract features from corpus and initialize model

ndash -m model directory

ndash -x path of maximal substring extractor (execute as external process)

ndash --ff Ignore less than the specified value

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 57

Maximal String Extractor

bull maxsubst [input file] [output file]

ndash Input as multiple line text

bull Replace TABs to ldquo ldquo line feeds to U+0001 in it

ndash Output as rdquo[features]yent[frequency]rdquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 58

Usage (2) Learn

bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]

ndash Learn the model using the corpus on 1 cycle of SGD

ndash -e learning rate of SGD

ndash -r regularizer of L1 regularization

ndash --wr what times to regularize for whole parameters

bull Parameters are too many to regularize the whle ones every step

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 59

Usage (3) Shrink Model

bull ldigpy -m [model] --shrink

ndash Remove Unefficient features(all

parameters of which are 0) from the

model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 60

Usage (4) Detect Language

bull ldigpy -m [model] [test data]

ndash Detect languages of test data and output

its result and summary

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 61

Data Format

bull Training and test data

ndash [correct label]yent[meta data]yent[text]

en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD

62

Usage (5) Estimation Tool

bull serverpy -m [model] -p [port number]

ndash Open httplocalhost[port] after it is executed

ndash Output their language probabilities contained features and their parameters for a text inputed in the text area

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 63

Estimation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus

As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size

64

language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992

total 166711 165053 9901 922 974

Estimation for LIGA dataset

bull Estimate using LIGA[Tromp+ 11] dataset

with 9066 tweets for 6 languages

ndash httpwwwwintuenl~mpechenprojectssmm

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Use 19 language model

65

Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996

total 9066 8992 992

Estimation for Europarl Dataset

Only supported languages for ldig

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 66

ldig langdetect CLDlanguage size correct rate correct rate correct rate

bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993

total 21000 13957 997 20850 993 20814 991

Conclusions

bull Language detector using maximal substring model

ndash Detect over 99 accuracy for 19 languages

ndash langdetect with tweet corpus even has 97 accuracy

bull If the corpus is maintained the precision will be still up

ndash There are still many mistakes (in particular da and no)

bull If metadata is added to features the precision will be still up

ndash How to add and train metadata at low cost

bull Desire to shrink the model without loss of precision

ndash Too large for application (gt100MB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 67

References

bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定

bull [Okanohara+ 09] Text Categorization with All Substring Features

bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs

bull [Cavnar+ 94] N-Gram-Based Text Categorization

bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 68

Page 18: Short Text Language Detection with Infinity-Gram

Chromium Compact Language Detection (CLD)

bull Porting the language detector from

Google Chromium ndash httpcodegooglecompchromium-compact-language-detector

ndash Implementation in C++ Python binding

ndash of supported languages CLD = 76

langdetect = 53

ndash Accuracy CLD = 9882 langdetect =

9922

bull for 17 languages on Europarl datasets bull httpblogmikemccandlesscom201110accuracy-and-performance-of-googleshtml

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 20

Is twitter Language Detection difficult (1)

bull Tweet is too short to extract 3-gram features

ndash At most 140 characters on twitter

ndash URLs mentions and hashtags are not useful to

detect

bull LIGA [Tromp+ 11]

ndash Graph-features based on 3-gram

bull Add long distance features

bull 95~98 accuracy for twitter Language Detection

bull 6 languages (de en es fr it nl)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 21

Is twitter Language Detection difficult (2)

bull Tweet is too noisy

ndash Representations against the languages orthography often appear

ndash Acronym Abbreviation lengthened word (like Cooooolll)

bull Likelihood of tweet tends to get smaller on normal language model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

OMG Oh My God

LOL Laughing Out Loud

LMAO Laughing My Ass Out

F4F Follow for Follow

MDR Mort de Rire (French)

TKT Ne tlsquoInquiegravete Pas (Fr)

u you

ur your

4 for

i0u I love you

k che (Italian)

anke anche(Italian)

Letter k isnt used in Italian

22

Motivation to Detect Short Text Language

bull There are many small chunks of text in addition to twitter

ndash Schedule search query bulletin board and so on

ndash There are many questions about short text detection in the Issues Board of langdetect Project

bull httpcodegooglecomplanguage-detectionissuesdetailid=10

bull Detection for multi-language mixed text

ndash Cut the target document in paragraphs or lines

ndash Detect for each short text

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 23

Our Goal

bull Over 99 accuracy

ndash However it is too difficult to detect one

word sentence

ndash Our Goal is 99+ accurate detection for

sentence with more than 3 words

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 24

We need

bull Rich feature extractable model from

short text

ndash Maximal substring model

(infin-gram Logistic Regression)

bull and twitter-specific Language model

or Corpus to construct it

ndash about 700K tweet corpus with language

label

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 25

Proposal Method

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 26

How to increase features from 3-grams

bull The more n the more features

bull Maximum at n=infin that is all substring

ndash But it has O(T2) order

gram of n-gram

freq≧1 freq≧2 freq≧10

1 79 72 57

2 1896 1533 902

3 15970 10369 4525

4 64966 33941 10534

5 167543 69719 15538

6 323749 107861 18970

7 524634 142954 21093

8 760719 171995 22159

9 921361 193995 22696

cumulative distributuion of feature length for 5090 normalized English tweets (300KB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 27

Text Categorization with All Substring Features [Okanohara+ 09]

bull Multiclass Logistic Regression using all

substrings as features

ndash Maximal Substring makes the equivalent

model that can be constructed in linear

time

ndash Store features into TRIE fast prediction

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 28

Maximal Substring (1)

bull Define a containment(semi-order)

among non empty substrings

abracadabra

ndash ldquorardquo sub ldquobraldquo hArr all rdquorardquo occur

as the substring of ldquobrardquo

ndash ldquoardquo nsub ldquoraldquo hArr ldquoardquo occur in not only ldquoraldquo

but also ldquocardquo It is strictly defined with also its position in the substring

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 29

Maximal Substring (2)

bull Each equivalent class formed by the containment relationship has a unique maximal element that is named Maximal Substring

bull Maximal substrings of abracadabra are a abra and abracadabra

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 30

via httpdhatenanejpnokuno201202031328237067

Maximal Substring and Infinity-Gram

bull Frequencies of substrings that have a containment relationship always equal

bull In the model with linear combination of features it is possible to enclose the common feature values

bull Logistic regression with maximal substrings is equivalent to the one with infinity-grams

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Although the equivalence collapses for test set

we assumes that it can be approximated by a sufficiently large training set

31

Extended Suffix Array

bull Extended Suffix Array consists of

ndash SA=Suffix Array

ndash L=Longest Common Prefixes

ndash B=Burrows-Wheelers Transformed text

bull A maximal substring that occurs more than once corresponds to a internal node of Suffix Tree which is equivalent to a suffix with Lgt0 and BWT has more than 1 character type

ndash They can be calculated on linear time

bull esaxx Okanoharas implement of ESA

ndash httpcodegooglecompesaxx

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 32

via [Okanohara+ 09]

Corpus and Normalization

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 33

Target Languages

bull Limit character type to detect

ndash In short text detection mixed text can be

divided to type of characters

bull Latin alphabet language

ndash The most difficult alphabet type to detect

ndash Languages which speakers are over 5

million are more than 25

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 34

Whats Latin Alphabet

bull Latin alphabet ne ascii alphabet

ndash aring ą aelig eth Ħ ŋ and so on

bull They are assigned to 9 code blocks in Unicode

Range Name Supplement

U+0000-007F Basic Latin ascii

U+0080-00FF Latin-1 Supplement Most languages are covered with these U+0100-017F Latin Extended-A

U+0180-024F Latin Extended-B Rumanian

U+0250-02AF IPA Extensions

U+0300-036F Combining Diacritical Marks for tone symbol composition

U+1E00-1EFF Latin Extended Additional Vietnamese

U+2C60-2C7F Latin Extended-C These arenrsquot used by almost all present languages U+A720-A7FF Latin Extended-D

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 35

Latin Alphabets in Unicode Codepoint Chart

for Vietnamese only use often use sometimes

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 36

How to Create Corpus

bull Collect tweets with sample method of

twitter Streaming API

ndash Sampling 1 of all tweets (about 2

million tweets)

ndash Tweets in Latin alphabet language

account for 60 of them

bull The rest is only to annotate language

labels to these tweets

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 37

Language Label Annotation

bull Group tweets by their timezone

ndash French tweets account for about 1 of all ones

ndash But they account for 50 of ones in Paris

timezone only

bull Annotate tentative labels to tweets using

langdetect

ndash Remove non-French tweets from ones labeled lsquofrrsquo

ndash Recover French tweets from ones not labeled lsquofrrsquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 38

( 20 of the whole tweets have no timezone)

How to annotate

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 39

Swedish Norwegian Danish Vietnamese Lithuanian

Czech Hungarian Catalan Rumanian and Polish guides in turn

Created Corpus

bull Noiseless tweets for training data

bull Noiseful tweets with more than 3 words as test data

bull Work with Rauacutel Velaz and Hiroshi Manabe for Catalan corpus creation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 40

language training testca Catalan 9089 5082cs Czech 9082 7682da Dannish 7388 5524de German 44448 10065en English 44520 10168es Spanish 44118 10265fi Finnish 8087 7050fr French 44339 10098hu Hungarian 10030 4904id Indonesian 44722 10181it Italian 43366 10152nl Dutch 44682 10007no Norwegian 10124 8496pl Polish 16771 10152pt Portuguese 44215 10208ro Romanian 10021 5911sv Swedish 44054 10032tr Turkish 44703 10308vi Vietnamese 15030 10488

total 538789 166773

Simple Language Detection

bull Language detector can be constructed

from maximal substring model and

twitter corpus

ndash It still gets at most 98 accuracy

bull We guess it is necessary to reduce bias

ndash data size bias

ndash language-specific bias

ndash twitter-specific bias

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 41

Bias by Data Size

bull Tweet size in each language has huge bias

bull Level them out by sampling with replacement from each language up to the largest data

ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement

English

Portuguese

Spanish

Indonesian

Dutch

French

German

Turkish

Italian

Swedish

othersShort Text Language Detection with Infinity-Gram

(NAIST Seminar) 42

Convert to Lowercase on Multiple Languages

bull Conversion into lower case saves corpus and compresses model

bull But the lower case of I (U+0049) in Turkish differs from others

bull Convert to lower case excluding lsquoIrsquo

Upper case Lower case

Turkish

Azerbaijani

I (U+0049) ı (U+0131)

İ (U+0130) i (U+0069)

Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 43

Normalization for Rumanian

bull Rumanian uses acirc ă icirc ș ț in addition to a-z

bull There are 2 character type as st with a ldquobeardrdquo

ndash U+015E-F U+0162-3 st with cedilla

ndash U+0218-B st with comma below

bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia

bull The 2 code has the same design in some fonts

ndash Indistinguishable

ș ş U+0219 U+015F

ț ţ U+021B U+0163

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

44

Rumanian Character Affairs on PC

bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently

ndash 1989 Democratization in Rumania

ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode

ndash 2007 Rumania seated in the EU

ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 45

lsquost with cedillarsquo is used

on an advertisement board

in Bucharest

Normalization for Substitute Characters

bull lsquost with cedillarsquo are substitute characters

ndash But they are more popular than the others

ndash with cedilla with comma = 2 1

ndash ldquoRumanian IMErdquo outputs the substitutes too D

bull Regard lsquost with commarsquo as lsquost with cedillarsquo

ț ţ U+021B U+0163

I reckon it is similar to the relationship of

Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 46

Arabic Character Normalization (on language-detection)

bull Arabic and Persian have the similar trouble too

bull Character lsquoyehrsquo in Farsi corresponds to 2 code points

ndash Wikipedia uses ی (U+06cc Farsi yeh) only

ndash News uses ي(U+064a Arabic yeh) only

bull U+064a is a substitute in Farsi

ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc

ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails

bull Regard U+06cc as U+064a

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 47

Normalization for Vietnamese (1)

bull Vietnamese has 12 vowels

ndash a ă acirc e ecirc i y o ocirc ơ u ư

bull Vietnamese has 6 tones

ndash a ả agrave atilde aacute ạ

ndash These tone symbols are used also in general documents like news

bull The tone symbols can be appended to all vowels

ndash 12 6 = 72

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 48

Normalization for Vietnamese (2)

bull Representation of vowels with

tones

1 Use U+1ea0 - U+1ef9

bull ẵ = U+1eb5

2 Combine with Diacritical Marks

bull ẵ = U+0103 U+0303

ndash Half and half on news and tweet

bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 49

CJK-Kanji Normalization (1) (on language-detection)

bull CJK-Kanji has too many characters(more than 20K)

ndash Other character types have only 30-50 characters

bull The character space is very sparse

ndash Characters that donrsquot occur in the training corpus have no probabilities

bull eg 谢谢 Kanji for person name

ndash Common frequent characters are too strong

bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese

bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 50

CJK-Kanji Normalization (2) (on language-detection)

bull Group Kanjis by frequency and normalize each group to the representative character

ndash (1) K-means clustering

bull Use tf-idf on Wikipedia and Google News

bull K=50 (size of ascii alphabet = 52)

ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese

bull Simplified Chinese 现代汉语常用字表(3500)

bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)

bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998

ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much

bull Generate 130 clusters from product of (1) and (2)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 51

Normalization for twitter

bull Remove simply

ndash URL

ndash mention

ndash hash tag

ndash RT

ndash face mark using alphabet like XD p

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 52

Normalization for twitter-Specific Representation

bull How to Like lsquocoooooooollllllrsquo

bull Case 1 Make a normalization dictionary using [Brody+ 2011]

ndash Unsupervised normalization like coooollll rarr cool

ndash It canrsquot handle words that are not in the dictionary

bull Case 2 If the same character continues in more than 3 Shrink it to 2

ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of

bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on

bull Acronym (like WWW СССР) is not useful for language detection

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 53

Laugh Normalization

bull There are various laughs on each language

ndash HOW MUCH DO YOU LOVE COACH BEISTE

HHAHAHAHAHAH

ndash Hihihihi ) Habe ich regulaumlr 2x die Woche

ndash Tafil con eso Jajajajajajaja

ndash Malo Jejejeje XP

ndash kekeke chỗ đoacute lagravem aacuteo được ko em

bull Shrink them to double

ndash hahahha rArr haha

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 54

Implementation and Estimation

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 55

Language Detection with Infinity-Gram (ldig)

bull tweet language detection for Latin

alphabet

ndash httpsgithubcomshuyoldig

bull MIT license

bull Distribute also the trained model here

ndash infin-gram LR(maximal substring) [Okanohara+ 09]

ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]

ndash Double Array

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 56

Usage (1) Model Initialization

bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]

ndash Extract features from corpus and initialize model

ndash -m model directory

ndash -x path of maximal substring extractor (execute as external process)

ndash --ff Ignore less than the specified value

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 57

Maximal String Extractor

bull maxsubst [input file] [output file]

ndash Input as multiple line text

bull Replace TABs to ldquo ldquo line feeds to U+0001 in it

ndash Output as rdquo[features]yent[frequency]rdquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 58

Usage (2) Learn

bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]

ndash Learn the model using the corpus on 1 cycle of SGD

ndash -e learning rate of SGD

ndash -r regularizer of L1 regularization

ndash --wr what times to regularize for whole parameters

bull Parameters are too many to regularize the whle ones every step

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 59

Usage (3) Shrink Model

bull ldigpy -m [model] --shrink

ndash Remove Unefficient features(all

parameters of which are 0) from the

model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 60

Usage (4) Detect Language

bull ldigpy -m [model] [test data]

ndash Detect languages of test data and output

its result and summary

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 61

Data Format

bull Training and test data

ndash [correct label]yent[meta data]yent[text]

en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD

62

Usage (5) Estimation Tool

bull serverpy -m [model] -p [port number]

ndash Open httplocalhost[port] after it is executed

ndash Output their language probabilities contained features and their parameters for a text inputed in the text area

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 63

Estimation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus

As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size

64

language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992

total 166711 165053 9901 922 974

Estimation for LIGA dataset

bull Estimate using LIGA[Tromp+ 11] dataset

with 9066 tweets for 6 languages

ndash httpwwwwintuenl~mpechenprojectssmm

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Use 19 language model

65

Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996

total 9066 8992 992

Estimation for Europarl Dataset

Only supported languages for ldig

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 66

ldig langdetect CLDlanguage size correct rate correct rate correct rate

bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993

total 21000 13957 997 20850 993 20814 991

Conclusions

bull Language detector using maximal substring model

ndash Detect over 99 accuracy for 19 languages

ndash langdetect with tweet corpus even has 97 accuracy

bull If the corpus is maintained the precision will be still up

ndash There are still many mistakes (in particular da and no)

bull If metadata is added to features the precision will be still up

ndash How to add and train metadata at low cost

bull Desire to shrink the model without loss of precision

ndash Too large for application (gt100MB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 67

References

bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定

bull [Okanohara+ 09] Text Categorization with All Substring Features

bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs

bull [Cavnar+ 94] N-Gram-Based Text Categorization

bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 68

Page 19: Short Text Language Detection with Infinity-Gram

Is twitter Language Detection difficult (1)

bull Tweet is too short to extract 3-gram features

ndash At most 140 characters on twitter

ndash URLs mentions and hashtags are not useful to

detect

bull LIGA [Tromp+ 11]

ndash Graph-features based on 3-gram

bull Add long distance features

bull 95~98 accuracy for twitter Language Detection

bull 6 languages (de en es fr it nl)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 21

Is twitter Language Detection difficult (2)

bull Tweet is too noisy

ndash Representations against the languages orthography often appear

ndash Acronym Abbreviation lengthened word (like Cooooolll)

bull Likelihood of tweet tends to get smaller on normal language model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

OMG Oh My God

LOL Laughing Out Loud

LMAO Laughing My Ass Out

F4F Follow for Follow

MDR Mort de Rire (French)

TKT Ne tlsquoInquiegravete Pas (Fr)

u you

ur your

4 for

i0u I love you

k che (Italian)

anke anche(Italian)

Letter k isnt used in Italian

22

Motivation to Detect Short Text Language

bull There are many small chunks of text in addition to twitter

ndash Schedule search query bulletin board and so on

ndash There are many questions about short text detection in the Issues Board of langdetect Project

bull httpcodegooglecomplanguage-detectionissuesdetailid=10

bull Detection for multi-language mixed text

ndash Cut the target document in paragraphs or lines

ndash Detect for each short text

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 23

Our Goal

bull Over 99 accuracy

ndash However it is too difficult to detect one

word sentence

ndash Our Goal is 99+ accurate detection for

sentence with more than 3 words

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 24

We need

bull Rich feature extractable model from

short text

ndash Maximal substring model

(infin-gram Logistic Regression)

bull and twitter-specific Language model

or Corpus to construct it

ndash about 700K tweet corpus with language

label

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 25

Proposal Method

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 26

How to increase features from 3-grams

bull The more n the more features

bull Maximum at n=infin that is all substring

ndash But it has O(T2) order

gram of n-gram

freq≧1 freq≧2 freq≧10

1 79 72 57

2 1896 1533 902

3 15970 10369 4525

4 64966 33941 10534

5 167543 69719 15538

6 323749 107861 18970

7 524634 142954 21093

8 760719 171995 22159

9 921361 193995 22696

cumulative distributuion of feature length for 5090 normalized English tweets (300KB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 27

Text Categorization with All Substring Features [Okanohara+ 09]

bull Multiclass Logistic Regression using all

substrings as features

ndash Maximal Substring makes the equivalent

model that can be constructed in linear

time

ndash Store features into TRIE fast prediction

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 28

Maximal Substring (1)

bull Define a containment(semi-order)

among non empty substrings

abracadabra

ndash ldquorardquo sub ldquobraldquo hArr all rdquorardquo occur

as the substring of ldquobrardquo

ndash ldquoardquo nsub ldquoraldquo hArr ldquoardquo occur in not only ldquoraldquo

but also ldquocardquo It is strictly defined with also its position in the substring

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 29

Maximal Substring (2)

bull Each equivalent class formed by the containment relationship has a unique maximal element that is named Maximal Substring

bull Maximal substrings of abracadabra are a abra and abracadabra

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 30

via httpdhatenanejpnokuno201202031328237067

Maximal Substring and Infinity-Gram

bull Frequencies of substrings that have a containment relationship always equal

bull In the model with linear combination of features it is possible to enclose the common feature values

bull Logistic regression with maximal substrings is equivalent to the one with infinity-grams

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Although the equivalence collapses for test set

we assumes that it can be approximated by a sufficiently large training set

31

Extended Suffix Array

bull Extended Suffix Array consists of

ndash SA=Suffix Array

ndash L=Longest Common Prefixes

ndash B=Burrows-Wheelers Transformed text

bull A maximal substring that occurs more than once corresponds to a internal node of Suffix Tree which is equivalent to a suffix with Lgt0 and BWT has more than 1 character type

ndash They can be calculated on linear time

bull esaxx Okanoharas implement of ESA

ndash httpcodegooglecompesaxx

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 32

via [Okanohara+ 09]

Corpus and Normalization

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 33

Target Languages

bull Limit character type to detect

ndash In short text detection mixed text can be

divided to type of characters

bull Latin alphabet language

ndash The most difficult alphabet type to detect

ndash Languages which speakers are over 5

million are more than 25

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 34

Whats Latin Alphabet

bull Latin alphabet ne ascii alphabet

ndash aring ą aelig eth Ħ ŋ and so on

bull They are assigned to 9 code blocks in Unicode

Range Name Supplement

U+0000-007F Basic Latin ascii

U+0080-00FF Latin-1 Supplement Most languages are covered with these U+0100-017F Latin Extended-A

U+0180-024F Latin Extended-B Rumanian

U+0250-02AF IPA Extensions

U+0300-036F Combining Diacritical Marks for tone symbol composition

U+1E00-1EFF Latin Extended Additional Vietnamese

U+2C60-2C7F Latin Extended-C These arenrsquot used by almost all present languages U+A720-A7FF Latin Extended-D

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 35

Latin Alphabets in Unicode Codepoint Chart

for Vietnamese only use often use sometimes

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 36

How to Create Corpus

bull Collect tweets with sample method of

twitter Streaming API

ndash Sampling 1 of all tweets (about 2

million tweets)

ndash Tweets in Latin alphabet language

account for 60 of them

bull The rest is only to annotate language

labels to these tweets

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 37

Language Label Annotation

bull Group tweets by their timezone

ndash French tweets account for about 1 of all ones

ndash But they account for 50 of ones in Paris

timezone only

bull Annotate tentative labels to tweets using

langdetect

ndash Remove non-French tweets from ones labeled lsquofrrsquo

ndash Recover French tweets from ones not labeled lsquofrrsquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 38

( 20 of the whole tweets have no timezone)

How to annotate

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 39

Swedish Norwegian Danish Vietnamese Lithuanian

Czech Hungarian Catalan Rumanian and Polish guides in turn

Created Corpus

bull Noiseless tweets for training data

bull Noiseful tweets with more than 3 words as test data

bull Work with Rauacutel Velaz and Hiroshi Manabe for Catalan corpus creation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 40

language training testca Catalan 9089 5082cs Czech 9082 7682da Dannish 7388 5524de German 44448 10065en English 44520 10168es Spanish 44118 10265fi Finnish 8087 7050fr French 44339 10098hu Hungarian 10030 4904id Indonesian 44722 10181it Italian 43366 10152nl Dutch 44682 10007no Norwegian 10124 8496pl Polish 16771 10152pt Portuguese 44215 10208ro Romanian 10021 5911sv Swedish 44054 10032tr Turkish 44703 10308vi Vietnamese 15030 10488

total 538789 166773

Simple Language Detection

bull Language detector can be constructed

from maximal substring model and

twitter corpus

ndash It still gets at most 98 accuracy

bull We guess it is necessary to reduce bias

ndash data size bias

ndash language-specific bias

ndash twitter-specific bias

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 41

Bias by Data Size

bull Tweet size in each language has huge bias

bull Level them out by sampling with replacement from each language up to the largest data

ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement

English

Portuguese

Spanish

Indonesian

Dutch

French

German

Turkish

Italian

Swedish

othersShort Text Language Detection with Infinity-Gram

(NAIST Seminar) 42

Convert to Lowercase on Multiple Languages

bull Conversion into lower case saves corpus and compresses model

bull But the lower case of I (U+0049) in Turkish differs from others

bull Convert to lower case excluding lsquoIrsquo

Upper case Lower case

Turkish

Azerbaijani

I (U+0049) ı (U+0131)

İ (U+0130) i (U+0069)

Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 43

Normalization for Rumanian

bull Rumanian uses acirc ă icirc ș ț in addition to a-z

bull There are 2 character type as st with a ldquobeardrdquo

ndash U+015E-F U+0162-3 st with cedilla

ndash U+0218-B st with comma below

bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia

bull The 2 code has the same design in some fonts

ndash Indistinguishable

ș ş U+0219 U+015F

ț ţ U+021B U+0163

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

44

Rumanian Character Affairs on PC

bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently

ndash 1989 Democratization in Rumania

ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode

ndash 2007 Rumania seated in the EU

ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 45

lsquost with cedillarsquo is used

on an advertisement board

in Bucharest

Normalization for Substitute Characters

bull lsquost with cedillarsquo are substitute characters

ndash But they are more popular than the others

ndash with cedilla with comma = 2 1

ndash ldquoRumanian IMErdquo outputs the substitutes too D

bull Regard lsquost with commarsquo as lsquost with cedillarsquo

ț ţ U+021B U+0163

I reckon it is similar to the relationship of

Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 46

Arabic Character Normalization (on language-detection)

bull Arabic and Persian have the similar trouble too

bull Character lsquoyehrsquo in Farsi corresponds to 2 code points

ndash Wikipedia uses ی (U+06cc Farsi yeh) only

ndash News uses ي(U+064a Arabic yeh) only

bull U+064a is a substitute in Farsi

ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc

ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails

bull Regard U+06cc as U+064a

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 47

Normalization for Vietnamese (1)

bull Vietnamese has 12 vowels

ndash a ă acirc e ecirc i y o ocirc ơ u ư

bull Vietnamese has 6 tones

ndash a ả agrave atilde aacute ạ

ndash These tone symbols are used also in general documents like news

bull The tone symbols can be appended to all vowels

ndash 12 6 = 72

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 48

Normalization for Vietnamese (2)

bull Representation of vowels with

tones

1 Use U+1ea0 - U+1ef9

bull ẵ = U+1eb5

2 Combine with Diacritical Marks

bull ẵ = U+0103 U+0303

ndash Half and half on news and tweet

bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 49

CJK-Kanji Normalization (1) (on language-detection)

bull CJK-Kanji has too many characters(more than 20K)

ndash Other character types have only 30-50 characters

bull The character space is very sparse

ndash Characters that donrsquot occur in the training corpus have no probabilities

bull eg 谢谢 Kanji for person name

ndash Common frequent characters are too strong

bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese

bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 50

CJK-Kanji Normalization (2) (on language-detection)

bull Group Kanjis by frequency and normalize each group to the representative character

ndash (1) K-means clustering

bull Use tf-idf on Wikipedia and Google News

bull K=50 (size of ascii alphabet = 52)

ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese

bull Simplified Chinese 现代汉语常用字表(3500)

bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)

bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998

ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much

bull Generate 130 clusters from product of (1) and (2)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 51

Normalization for twitter

bull Remove simply

ndash URL

ndash mention

ndash hash tag

ndash RT

ndash face mark using alphabet like XD p

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 52

Normalization for twitter-Specific Representation

bull How to Like lsquocoooooooollllllrsquo

bull Case 1 Make a normalization dictionary using [Brody+ 2011]

ndash Unsupervised normalization like coooollll rarr cool

ndash It canrsquot handle words that are not in the dictionary

bull Case 2 If the same character continues in more than 3 Shrink it to 2

ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of

bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on

bull Acronym (like WWW СССР) is not useful for language detection

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 53

Laugh Normalization

bull There are various laughs on each language

ndash HOW MUCH DO YOU LOVE COACH BEISTE

HHAHAHAHAHAH

ndash Hihihihi ) Habe ich regulaumlr 2x die Woche

ndash Tafil con eso Jajajajajajaja

ndash Malo Jejejeje XP

ndash kekeke chỗ đoacute lagravem aacuteo được ko em

bull Shrink them to double

ndash hahahha rArr haha

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 54

Implementation and Estimation

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 55

Language Detection with Infinity-Gram (ldig)

bull tweet language detection for Latin

alphabet

ndash httpsgithubcomshuyoldig

bull MIT license

bull Distribute also the trained model here

ndash infin-gram LR(maximal substring) [Okanohara+ 09]

ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]

ndash Double Array

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 56

Usage (1) Model Initialization

bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]

ndash Extract features from corpus and initialize model

ndash -m model directory

ndash -x path of maximal substring extractor (execute as external process)

ndash --ff Ignore less than the specified value

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 57

Maximal String Extractor

bull maxsubst [input file] [output file]

ndash Input as multiple line text

bull Replace TABs to ldquo ldquo line feeds to U+0001 in it

ndash Output as rdquo[features]yent[frequency]rdquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 58

Usage (2) Learn

bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]

ndash Learn the model using the corpus on 1 cycle of SGD

ndash -e learning rate of SGD

ndash -r regularizer of L1 regularization

ndash --wr what times to regularize for whole parameters

bull Parameters are too many to regularize the whle ones every step

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 59

Usage (3) Shrink Model

bull ldigpy -m [model] --shrink

ndash Remove Unefficient features(all

parameters of which are 0) from the

model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 60

Usage (4) Detect Language

bull ldigpy -m [model] [test data]

ndash Detect languages of test data and output

its result and summary

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 61

Data Format

bull Training and test data

ndash [correct label]yent[meta data]yent[text]

en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD

62

Usage (5) Estimation Tool

bull serverpy -m [model] -p [port number]

ndash Open httplocalhost[port] after it is executed

ndash Output their language probabilities contained features and their parameters for a text inputed in the text area

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 63

Estimation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus

As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size

64

language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992

total 166711 165053 9901 922 974

Estimation for LIGA dataset

bull Estimate using LIGA[Tromp+ 11] dataset

with 9066 tweets for 6 languages

ndash httpwwwwintuenl~mpechenprojectssmm

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Use 19 language model

65

Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996

total 9066 8992 992

Estimation for Europarl Dataset

Only supported languages for ldig

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 66

ldig langdetect CLDlanguage size correct rate correct rate correct rate

bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993

total 21000 13957 997 20850 993 20814 991

Conclusions

bull Language detector using maximal substring model

ndash Detect over 99 accuracy for 19 languages

ndash langdetect with tweet corpus even has 97 accuracy

bull If the corpus is maintained the precision will be still up

ndash There are still many mistakes (in particular da and no)

bull If metadata is added to features the precision will be still up

ndash How to add and train metadata at low cost

bull Desire to shrink the model without loss of precision

ndash Too large for application (gt100MB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 67

References

bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定

bull [Okanohara+ 09] Text Categorization with All Substring Features

bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs

bull [Cavnar+ 94] N-Gram-Based Text Categorization

bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 68

Page 20: Short Text Language Detection with Infinity-Gram

Is twitter Language Detection difficult (2)

bull Tweet is too noisy

ndash Representations against the languages orthography often appear

ndash Acronym Abbreviation lengthened word (like Cooooolll)

bull Likelihood of tweet tends to get smaller on normal language model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

OMG Oh My God

LOL Laughing Out Loud

LMAO Laughing My Ass Out

F4F Follow for Follow

MDR Mort de Rire (French)

TKT Ne tlsquoInquiegravete Pas (Fr)

u you

ur your

4 for

i0u I love you

k che (Italian)

anke anche(Italian)

Letter k isnt used in Italian

22

Motivation to Detect Short Text Language

bull There are many small chunks of text in addition to twitter

ndash Schedule search query bulletin board and so on

ndash There are many questions about short text detection in the Issues Board of langdetect Project

bull httpcodegooglecomplanguage-detectionissuesdetailid=10

bull Detection for multi-language mixed text

ndash Cut the target document in paragraphs or lines

ndash Detect for each short text

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 23

Our Goal

bull Over 99 accuracy

ndash However it is too difficult to detect one

word sentence

ndash Our Goal is 99+ accurate detection for

sentence with more than 3 words

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 24

We need

bull Rich feature extractable model from

short text

ndash Maximal substring model

(infin-gram Logistic Regression)

bull and twitter-specific Language model

or Corpus to construct it

ndash about 700K tweet corpus with language

label

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 25

Proposal Method

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 26

How to increase features from 3-grams

bull The more n the more features

bull Maximum at n=infin that is all substring

ndash But it has O(T2) order

gram of n-gram

freq≧1 freq≧2 freq≧10

1 79 72 57

2 1896 1533 902

3 15970 10369 4525

4 64966 33941 10534

5 167543 69719 15538

6 323749 107861 18970

7 524634 142954 21093

8 760719 171995 22159

9 921361 193995 22696

cumulative distributuion of feature length for 5090 normalized English tweets (300KB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 27

Text Categorization with All Substring Features [Okanohara+ 09]

bull Multiclass Logistic Regression using all

substrings as features

ndash Maximal Substring makes the equivalent

model that can be constructed in linear

time

ndash Store features into TRIE fast prediction

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 28

Maximal Substring (1)

bull Define a containment(semi-order)

among non empty substrings

abracadabra

ndash ldquorardquo sub ldquobraldquo hArr all rdquorardquo occur

as the substring of ldquobrardquo

ndash ldquoardquo nsub ldquoraldquo hArr ldquoardquo occur in not only ldquoraldquo

but also ldquocardquo It is strictly defined with also its position in the substring

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 29

Maximal Substring (2)

bull Each equivalent class formed by the containment relationship has a unique maximal element that is named Maximal Substring

bull Maximal substrings of abracadabra are a abra and abracadabra

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 30

via httpdhatenanejpnokuno201202031328237067

Maximal Substring and Infinity-Gram

bull Frequencies of substrings that have a containment relationship always equal

bull In the model with linear combination of features it is possible to enclose the common feature values

bull Logistic regression with maximal substrings is equivalent to the one with infinity-grams

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Although the equivalence collapses for test set

we assumes that it can be approximated by a sufficiently large training set

31

Extended Suffix Array

bull Extended Suffix Array consists of

ndash SA=Suffix Array

ndash L=Longest Common Prefixes

ndash B=Burrows-Wheelers Transformed text

bull A maximal substring that occurs more than once corresponds to a internal node of Suffix Tree which is equivalent to a suffix with Lgt0 and BWT has more than 1 character type

ndash They can be calculated on linear time

bull esaxx Okanoharas implement of ESA

ndash httpcodegooglecompesaxx

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 32

via [Okanohara+ 09]

Corpus and Normalization

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 33

Target Languages

bull Limit character type to detect

ndash In short text detection mixed text can be

divided to type of characters

bull Latin alphabet language

ndash The most difficult alphabet type to detect

ndash Languages which speakers are over 5

million are more than 25

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 34

Whats Latin Alphabet

bull Latin alphabet ne ascii alphabet

ndash aring ą aelig eth Ħ ŋ and so on

bull They are assigned to 9 code blocks in Unicode

Range Name Supplement

U+0000-007F Basic Latin ascii

U+0080-00FF Latin-1 Supplement Most languages are covered with these U+0100-017F Latin Extended-A

U+0180-024F Latin Extended-B Rumanian

U+0250-02AF IPA Extensions

U+0300-036F Combining Diacritical Marks for tone symbol composition

U+1E00-1EFF Latin Extended Additional Vietnamese

U+2C60-2C7F Latin Extended-C These arenrsquot used by almost all present languages U+A720-A7FF Latin Extended-D

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 35

Latin Alphabets in Unicode Codepoint Chart

for Vietnamese only use often use sometimes

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 36

How to Create Corpus

bull Collect tweets with sample method of

twitter Streaming API

ndash Sampling 1 of all tweets (about 2

million tweets)

ndash Tweets in Latin alphabet language

account for 60 of them

bull The rest is only to annotate language

labels to these tweets

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 37

Language Label Annotation

bull Group tweets by their timezone

ndash French tweets account for about 1 of all ones

ndash But they account for 50 of ones in Paris

timezone only

bull Annotate tentative labels to tweets using

langdetect

ndash Remove non-French tweets from ones labeled lsquofrrsquo

ndash Recover French tweets from ones not labeled lsquofrrsquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 38

( 20 of the whole tweets have no timezone)

How to annotate

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 39

Swedish Norwegian Danish Vietnamese Lithuanian

Czech Hungarian Catalan Rumanian and Polish guides in turn

Created Corpus

bull Noiseless tweets for training data

bull Noiseful tweets with more than 3 words as test data

bull Work with Rauacutel Velaz and Hiroshi Manabe for Catalan corpus creation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 40

language training testca Catalan 9089 5082cs Czech 9082 7682da Dannish 7388 5524de German 44448 10065en English 44520 10168es Spanish 44118 10265fi Finnish 8087 7050fr French 44339 10098hu Hungarian 10030 4904id Indonesian 44722 10181it Italian 43366 10152nl Dutch 44682 10007no Norwegian 10124 8496pl Polish 16771 10152pt Portuguese 44215 10208ro Romanian 10021 5911sv Swedish 44054 10032tr Turkish 44703 10308vi Vietnamese 15030 10488

total 538789 166773

Simple Language Detection

bull Language detector can be constructed

from maximal substring model and

twitter corpus

ndash It still gets at most 98 accuracy

bull We guess it is necessary to reduce bias

ndash data size bias

ndash language-specific bias

ndash twitter-specific bias

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 41

Bias by Data Size

bull Tweet size in each language has huge bias

bull Level them out by sampling with replacement from each language up to the largest data

ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement

English

Portuguese

Spanish

Indonesian

Dutch

French

German

Turkish

Italian

Swedish

othersShort Text Language Detection with Infinity-Gram

(NAIST Seminar) 42

Convert to Lowercase on Multiple Languages

bull Conversion into lower case saves corpus and compresses model

bull But the lower case of I (U+0049) in Turkish differs from others

bull Convert to lower case excluding lsquoIrsquo

Upper case Lower case

Turkish

Azerbaijani

I (U+0049) ı (U+0131)

İ (U+0130) i (U+0069)

Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 43

Normalization for Rumanian

bull Rumanian uses acirc ă icirc ș ț in addition to a-z

bull There are 2 character type as st with a ldquobeardrdquo

ndash U+015E-F U+0162-3 st with cedilla

ndash U+0218-B st with comma below

bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia

bull The 2 code has the same design in some fonts

ndash Indistinguishable

ș ş U+0219 U+015F

ț ţ U+021B U+0163

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

44

Rumanian Character Affairs on PC

bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently

ndash 1989 Democratization in Rumania

ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode

ndash 2007 Rumania seated in the EU

ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 45

lsquost with cedillarsquo is used

on an advertisement board

in Bucharest

Normalization for Substitute Characters

bull lsquost with cedillarsquo are substitute characters

ndash But they are more popular than the others

ndash with cedilla with comma = 2 1

ndash ldquoRumanian IMErdquo outputs the substitutes too D

bull Regard lsquost with commarsquo as lsquost with cedillarsquo

ț ţ U+021B U+0163

I reckon it is similar to the relationship of

Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 46

Arabic Character Normalization (on language-detection)

bull Arabic and Persian have the similar trouble too

bull Character lsquoyehrsquo in Farsi corresponds to 2 code points

ndash Wikipedia uses ی (U+06cc Farsi yeh) only

ndash News uses ي(U+064a Arabic yeh) only

bull U+064a is a substitute in Farsi

ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc

ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails

bull Regard U+06cc as U+064a

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 47

Normalization for Vietnamese (1)

bull Vietnamese has 12 vowels

ndash a ă acirc e ecirc i y o ocirc ơ u ư

bull Vietnamese has 6 tones

ndash a ả agrave atilde aacute ạ

ndash These tone symbols are used also in general documents like news

bull The tone symbols can be appended to all vowels

ndash 12 6 = 72

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 48

Normalization for Vietnamese (2)

bull Representation of vowels with

tones

1 Use U+1ea0 - U+1ef9

bull ẵ = U+1eb5

2 Combine with Diacritical Marks

bull ẵ = U+0103 U+0303

ndash Half and half on news and tweet

bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 49

CJK-Kanji Normalization (1) (on language-detection)

bull CJK-Kanji has too many characters(more than 20K)

ndash Other character types have only 30-50 characters

bull The character space is very sparse

ndash Characters that donrsquot occur in the training corpus have no probabilities

bull eg 谢谢 Kanji for person name

ndash Common frequent characters are too strong

bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese

bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 50

CJK-Kanji Normalization (2) (on language-detection)

bull Group Kanjis by frequency and normalize each group to the representative character

ndash (1) K-means clustering

bull Use tf-idf on Wikipedia and Google News

bull K=50 (size of ascii alphabet = 52)

ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese

bull Simplified Chinese 现代汉语常用字表(3500)

bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)

bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998

ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much

bull Generate 130 clusters from product of (1) and (2)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 51

Normalization for twitter

bull Remove simply

ndash URL

ndash mention

ndash hash tag

ndash RT

ndash face mark using alphabet like XD p

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 52

Normalization for twitter-Specific Representation

bull How to Like lsquocoooooooollllllrsquo

bull Case 1 Make a normalization dictionary using [Brody+ 2011]

ndash Unsupervised normalization like coooollll rarr cool

ndash It canrsquot handle words that are not in the dictionary

bull Case 2 If the same character continues in more than 3 Shrink it to 2

ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of

bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on

bull Acronym (like WWW СССР) is not useful for language detection

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 53

Laugh Normalization

bull There are various laughs on each language

ndash HOW MUCH DO YOU LOVE COACH BEISTE

HHAHAHAHAHAH

ndash Hihihihi ) Habe ich regulaumlr 2x die Woche

ndash Tafil con eso Jajajajajajaja

ndash Malo Jejejeje XP

ndash kekeke chỗ đoacute lagravem aacuteo được ko em

bull Shrink them to double

ndash hahahha rArr haha

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 54

Implementation and Estimation

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 55

Language Detection with Infinity-Gram (ldig)

bull tweet language detection for Latin

alphabet

ndash httpsgithubcomshuyoldig

bull MIT license

bull Distribute also the trained model here

ndash infin-gram LR(maximal substring) [Okanohara+ 09]

ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]

ndash Double Array

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 56

Usage (1) Model Initialization

bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]

ndash Extract features from corpus and initialize model

ndash -m model directory

ndash -x path of maximal substring extractor (execute as external process)

ndash --ff Ignore less than the specified value

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 57

Maximal String Extractor

bull maxsubst [input file] [output file]

ndash Input as multiple line text

bull Replace TABs to ldquo ldquo line feeds to U+0001 in it

ndash Output as rdquo[features]yent[frequency]rdquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 58

Usage (2) Learn

bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]

ndash Learn the model using the corpus on 1 cycle of SGD

ndash -e learning rate of SGD

ndash -r regularizer of L1 regularization

ndash --wr what times to regularize for whole parameters

bull Parameters are too many to regularize the whle ones every step

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 59

Usage (3) Shrink Model

bull ldigpy -m [model] --shrink

ndash Remove Unefficient features(all

parameters of which are 0) from the

model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 60

Usage (4) Detect Language

bull ldigpy -m [model] [test data]

ndash Detect languages of test data and output

its result and summary

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 61

Data Format

bull Training and test data

ndash [correct label]yent[meta data]yent[text]

en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD

62

Usage (5) Estimation Tool

bull serverpy -m [model] -p [port number]

ndash Open httplocalhost[port] after it is executed

ndash Output their language probabilities contained features and their parameters for a text inputed in the text area

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 63

Estimation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus

As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size

64

language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992

total 166711 165053 9901 922 974

Estimation for LIGA dataset

bull Estimate using LIGA[Tromp+ 11] dataset

with 9066 tweets for 6 languages

ndash httpwwwwintuenl~mpechenprojectssmm

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Use 19 language model

65

Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996

total 9066 8992 992

Estimation for Europarl Dataset

Only supported languages for ldig

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 66

ldig langdetect CLDlanguage size correct rate correct rate correct rate

bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993

total 21000 13957 997 20850 993 20814 991

Conclusions

bull Language detector using maximal substring model

ndash Detect over 99 accuracy for 19 languages

ndash langdetect with tweet corpus even has 97 accuracy

bull If the corpus is maintained the precision will be still up

ndash There are still many mistakes (in particular da and no)

bull If metadata is added to features the precision will be still up

ndash How to add and train metadata at low cost

bull Desire to shrink the model without loss of precision

ndash Too large for application (gt100MB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 67

References

bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定

bull [Okanohara+ 09] Text Categorization with All Substring Features

bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs

bull [Cavnar+ 94] N-Gram-Based Text Categorization

bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 68

Page 21: Short Text Language Detection with Infinity-Gram

Motivation to Detect Short Text Language

bull There are many small chunks of text in addition to twitter

ndash Schedule search query bulletin board and so on

ndash There are many questions about short text detection in the Issues Board of langdetect Project

bull httpcodegooglecomplanguage-detectionissuesdetailid=10

bull Detection for multi-language mixed text

ndash Cut the target document in paragraphs or lines

ndash Detect for each short text

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 23

Our Goal

bull Over 99 accuracy

ndash However it is too difficult to detect one

word sentence

ndash Our Goal is 99+ accurate detection for

sentence with more than 3 words

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 24

We need

bull Rich feature extractable model from

short text

ndash Maximal substring model

(infin-gram Logistic Regression)

bull and twitter-specific Language model

or Corpus to construct it

ndash about 700K tweet corpus with language

label

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 25

Proposal Method

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 26

How to increase features from 3-grams

bull The more n the more features

bull Maximum at n=infin that is all substring

ndash But it has O(T2) order

gram of n-gram

freq≧1 freq≧2 freq≧10

1 79 72 57

2 1896 1533 902

3 15970 10369 4525

4 64966 33941 10534

5 167543 69719 15538

6 323749 107861 18970

7 524634 142954 21093

8 760719 171995 22159

9 921361 193995 22696

cumulative distributuion of feature length for 5090 normalized English tweets (300KB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 27

Text Categorization with All Substring Features [Okanohara+ 09]

bull Multiclass Logistic Regression using all

substrings as features

ndash Maximal Substring makes the equivalent

model that can be constructed in linear

time

ndash Store features into TRIE fast prediction

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 28

Maximal Substring (1)

bull Define a containment(semi-order)

among non empty substrings

abracadabra

ndash ldquorardquo sub ldquobraldquo hArr all rdquorardquo occur

as the substring of ldquobrardquo

ndash ldquoardquo nsub ldquoraldquo hArr ldquoardquo occur in not only ldquoraldquo

but also ldquocardquo It is strictly defined with also its position in the substring

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 29

Maximal Substring (2)

bull Each equivalent class formed by the containment relationship has a unique maximal element that is named Maximal Substring

bull Maximal substrings of abracadabra are a abra and abracadabra

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 30

via httpdhatenanejpnokuno201202031328237067

Maximal Substring and Infinity-Gram

bull Frequencies of substrings that have a containment relationship always equal

bull In the model with linear combination of features it is possible to enclose the common feature values

bull Logistic regression with maximal substrings is equivalent to the one with infinity-grams

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Although the equivalence collapses for test set

we assumes that it can be approximated by a sufficiently large training set

31

Extended Suffix Array

bull Extended Suffix Array consists of

ndash SA=Suffix Array

ndash L=Longest Common Prefixes

ndash B=Burrows-Wheelers Transformed text

bull A maximal substring that occurs more than once corresponds to a internal node of Suffix Tree which is equivalent to a suffix with Lgt0 and BWT has more than 1 character type

ndash They can be calculated on linear time

bull esaxx Okanoharas implement of ESA

ndash httpcodegooglecompesaxx

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 32

via [Okanohara+ 09]

Corpus and Normalization

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 33

Target Languages

bull Limit character type to detect

ndash In short text detection mixed text can be

divided to type of characters

bull Latin alphabet language

ndash The most difficult alphabet type to detect

ndash Languages which speakers are over 5

million are more than 25

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 34

Whats Latin Alphabet

bull Latin alphabet ne ascii alphabet

ndash aring ą aelig eth Ħ ŋ and so on

bull They are assigned to 9 code blocks in Unicode

Range Name Supplement

U+0000-007F Basic Latin ascii

U+0080-00FF Latin-1 Supplement Most languages are covered with these U+0100-017F Latin Extended-A

U+0180-024F Latin Extended-B Rumanian

U+0250-02AF IPA Extensions

U+0300-036F Combining Diacritical Marks for tone symbol composition

U+1E00-1EFF Latin Extended Additional Vietnamese

U+2C60-2C7F Latin Extended-C These arenrsquot used by almost all present languages U+A720-A7FF Latin Extended-D

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 35

Latin Alphabets in Unicode Codepoint Chart

for Vietnamese only use often use sometimes

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 36

How to Create Corpus

bull Collect tweets with sample method of

twitter Streaming API

ndash Sampling 1 of all tweets (about 2

million tweets)

ndash Tweets in Latin alphabet language

account for 60 of them

bull The rest is only to annotate language

labels to these tweets

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 37

Language Label Annotation

bull Group tweets by their timezone

ndash French tweets account for about 1 of all ones

ndash But they account for 50 of ones in Paris

timezone only

bull Annotate tentative labels to tweets using

langdetect

ndash Remove non-French tweets from ones labeled lsquofrrsquo

ndash Recover French tweets from ones not labeled lsquofrrsquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 38

( 20 of the whole tweets have no timezone)

How to annotate

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 39

Swedish Norwegian Danish Vietnamese Lithuanian

Czech Hungarian Catalan Rumanian and Polish guides in turn

Created Corpus

bull Noiseless tweets for training data

bull Noiseful tweets with more than 3 words as test data

bull Work with Rauacutel Velaz and Hiroshi Manabe for Catalan corpus creation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 40

language training testca Catalan 9089 5082cs Czech 9082 7682da Dannish 7388 5524de German 44448 10065en English 44520 10168es Spanish 44118 10265fi Finnish 8087 7050fr French 44339 10098hu Hungarian 10030 4904id Indonesian 44722 10181it Italian 43366 10152nl Dutch 44682 10007no Norwegian 10124 8496pl Polish 16771 10152pt Portuguese 44215 10208ro Romanian 10021 5911sv Swedish 44054 10032tr Turkish 44703 10308vi Vietnamese 15030 10488

total 538789 166773

Simple Language Detection

bull Language detector can be constructed

from maximal substring model and

twitter corpus

ndash It still gets at most 98 accuracy

bull We guess it is necessary to reduce bias

ndash data size bias

ndash language-specific bias

ndash twitter-specific bias

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 41

Bias by Data Size

bull Tweet size in each language has huge bias

bull Level them out by sampling with replacement from each language up to the largest data

ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement

English

Portuguese

Spanish

Indonesian

Dutch

French

German

Turkish

Italian

Swedish

othersShort Text Language Detection with Infinity-Gram

(NAIST Seminar) 42

Convert to Lowercase on Multiple Languages

bull Conversion into lower case saves corpus and compresses model

bull But the lower case of I (U+0049) in Turkish differs from others

bull Convert to lower case excluding lsquoIrsquo

Upper case Lower case

Turkish

Azerbaijani

I (U+0049) ı (U+0131)

İ (U+0130) i (U+0069)

Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 43

Normalization for Rumanian

bull Rumanian uses acirc ă icirc ș ț in addition to a-z

bull There are 2 character type as st with a ldquobeardrdquo

ndash U+015E-F U+0162-3 st with cedilla

ndash U+0218-B st with comma below

bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia

bull The 2 code has the same design in some fonts

ndash Indistinguishable

ș ş U+0219 U+015F

ț ţ U+021B U+0163

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

44

Rumanian Character Affairs on PC

bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently

ndash 1989 Democratization in Rumania

ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode

ndash 2007 Rumania seated in the EU

ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 45

lsquost with cedillarsquo is used

on an advertisement board

in Bucharest

Normalization for Substitute Characters

bull lsquost with cedillarsquo are substitute characters

ndash But they are more popular than the others

ndash with cedilla with comma = 2 1

ndash ldquoRumanian IMErdquo outputs the substitutes too D

bull Regard lsquost with commarsquo as lsquost with cedillarsquo

ț ţ U+021B U+0163

I reckon it is similar to the relationship of

Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 46

Arabic Character Normalization (on language-detection)

bull Arabic and Persian have the similar trouble too

bull Character lsquoyehrsquo in Farsi corresponds to 2 code points

ndash Wikipedia uses ی (U+06cc Farsi yeh) only

ndash News uses ي(U+064a Arabic yeh) only

bull U+064a is a substitute in Farsi

ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc

ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails

bull Regard U+06cc as U+064a

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 47

Normalization for Vietnamese (1)

bull Vietnamese has 12 vowels

ndash a ă acirc e ecirc i y o ocirc ơ u ư

bull Vietnamese has 6 tones

ndash a ả agrave atilde aacute ạ

ndash These tone symbols are used also in general documents like news

bull The tone symbols can be appended to all vowels

ndash 12 6 = 72

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 48

Normalization for Vietnamese (2)

bull Representation of vowels with

tones

1 Use U+1ea0 - U+1ef9

bull ẵ = U+1eb5

2 Combine with Diacritical Marks

bull ẵ = U+0103 U+0303

ndash Half and half on news and tweet

bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 49

CJK-Kanji Normalization (1) (on language-detection)

bull CJK-Kanji has too many characters(more than 20K)

ndash Other character types have only 30-50 characters

bull The character space is very sparse

ndash Characters that donrsquot occur in the training corpus have no probabilities

bull eg 谢谢 Kanji for person name

ndash Common frequent characters are too strong

bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese

bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 50

CJK-Kanji Normalization (2) (on language-detection)

bull Group Kanjis by frequency and normalize each group to the representative character

ndash (1) K-means clustering

bull Use tf-idf on Wikipedia and Google News

bull K=50 (size of ascii alphabet = 52)

ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese

bull Simplified Chinese 现代汉语常用字表(3500)

bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)

bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998

ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much

bull Generate 130 clusters from product of (1) and (2)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 51

Normalization for twitter

bull Remove simply

ndash URL

ndash mention

ndash hash tag

ndash RT

ndash face mark using alphabet like XD p

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 52

Normalization for twitter-Specific Representation

bull How to Like lsquocoooooooollllllrsquo

bull Case 1 Make a normalization dictionary using [Brody+ 2011]

ndash Unsupervised normalization like coooollll rarr cool

ndash It canrsquot handle words that are not in the dictionary

bull Case 2 If the same character continues in more than 3 Shrink it to 2

ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of

bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on

bull Acronym (like WWW СССР) is not useful for language detection

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 53

Laugh Normalization

bull There are various laughs on each language

ndash HOW MUCH DO YOU LOVE COACH BEISTE

HHAHAHAHAHAH

ndash Hihihihi ) Habe ich regulaumlr 2x die Woche

ndash Tafil con eso Jajajajajajaja

ndash Malo Jejejeje XP

ndash kekeke chỗ đoacute lagravem aacuteo được ko em

bull Shrink them to double

ndash hahahha rArr haha

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 54

Implementation and Estimation

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 55

Language Detection with Infinity-Gram (ldig)

bull tweet language detection for Latin

alphabet

ndash httpsgithubcomshuyoldig

bull MIT license

bull Distribute also the trained model here

ndash infin-gram LR(maximal substring) [Okanohara+ 09]

ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]

ndash Double Array

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 56

Usage (1) Model Initialization

bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]

ndash Extract features from corpus and initialize model

ndash -m model directory

ndash -x path of maximal substring extractor (execute as external process)

ndash --ff Ignore less than the specified value

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 57

Maximal String Extractor

bull maxsubst [input file] [output file]

ndash Input as multiple line text

bull Replace TABs to ldquo ldquo line feeds to U+0001 in it

ndash Output as rdquo[features]yent[frequency]rdquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 58

Usage (2) Learn

bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]

ndash Learn the model using the corpus on 1 cycle of SGD

ndash -e learning rate of SGD

ndash -r regularizer of L1 regularization

ndash --wr what times to regularize for whole parameters

bull Parameters are too many to regularize the whle ones every step

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 59

Usage (3) Shrink Model

bull ldigpy -m [model] --shrink

ndash Remove Unefficient features(all

parameters of which are 0) from the

model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 60

Usage (4) Detect Language

bull ldigpy -m [model] [test data]

ndash Detect languages of test data and output

its result and summary

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 61

Data Format

bull Training and test data

ndash [correct label]yent[meta data]yent[text]

en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD

62

Usage (5) Estimation Tool

bull serverpy -m [model] -p [port number]

ndash Open httplocalhost[port] after it is executed

ndash Output their language probabilities contained features and their parameters for a text inputed in the text area

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 63

Estimation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus

As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size

64

language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992

total 166711 165053 9901 922 974

Estimation for LIGA dataset

bull Estimate using LIGA[Tromp+ 11] dataset

with 9066 tweets for 6 languages

ndash httpwwwwintuenl~mpechenprojectssmm

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Use 19 language model

65

Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996

total 9066 8992 992

Estimation for Europarl Dataset

Only supported languages for ldig

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 66

ldig langdetect CLDlanguage size correct rate correct rate correct rate

bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993

total 21000 13957 997 20850 993 20814 991

Conclusions

bull Language detector using maximal substring model

ndash Detect over 99 accuracy for 19 languages

ndash langdetect with tweet corpus even has 97 accuracy

bull If the corpus is maintained the precision will be still up

ndash There are still many mistakes (in particular da and no)

bull If metadata is added to features the precision will be still up

ndash How to add and train metadata at low cost

bull Desire to shrink the model without loss of precision

ndash Too large for application (gt100MB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 67

References

bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定

bull [Okanohara+ 09] Text Categorization with All Substring Features

bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs

bull [Cavnar+ 94] N-Gram-Based Text Categorization

bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 68

Page 22: Short Text Language Detection with Infinity-Gram

Our Goal

bull Over 99 accuracy

ndash However it is too difficult to detect one

word sentence

ndash Our Goal is 99+ accurate detection for

sentence with more than 3 words

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 24

We need

bull Rich feature extractable model from

short text

ndash Maximal substring model

(infin-gram Logistic Regression)

bull and twitter-specific Language model

or Corpus to construct it

ndash about 700K tweet corpus with language

label

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 25

Proposal Method

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 26

How to increase features from 3-grams

bull The more n the more features

bull Maximum at n=infin that is all substring

ndash But it has O(T2) order

gram of n-gram

freq≧1 freq≧2 freq≧10

1 79 72 57

2 1896 1533 902

3 15970 10369 4525

4 64966 33941 10534

5 167543 69719 15538

6 323749 107861 18970

7 524634 142954 21093

8 760719 171995 22159

9 921361 193995 22696

cumulative distributuion of feature length for 5090 normalized English tweets (300KB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 27

Text Categorization with All Substring Features [Okanohara+ 09]

bull Multiclass Logistic Regression using all

substrings as features

ndash Maximal Substring makes the equivalent

model that can be constructed in linear

time

ndash Store features into TRIE fast prediction

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 28

Maximal Substring (1)

bull Define a containment(semi-order)

among non empty substrings

abracadabra

ndash ldquorardquo sub ldquobraldquo hArr all rdquorardquo occur

as the substring of ldquobrardquo

ndash ldquoardquo nsub ldquoraldquo hArr ldquoardquo occur in not only ldquoraldquo

but also ldquocardquo It is strictly defined with also its position in the substring

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 29

Maximal Substring (2)

bull Each equivalent class formed by the containment relationship has a unique maximal element that is named Maximal Substring

bull Maximal substrings of abracadabra are a abra and abracadabra

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 30

via httpdhatenanejpnokuno201202031328237067

Maximal Substring and Infinity-Gram

bull Frequencies of substrings that have a containment relationship always equal

bull In the model with linear combination of features it is possible to enclose the common feature values

bull Logistic regression with maximal substrings is equivalent to the one with infinity-grams

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Although the equivalence collapses for test set

we assumes that it can be approximated by a sufficiently large training set

31

Extended Suffix Array

bull Extended Suffix Array consists of

ndash SA=Suffix Array

ndash L=Longest Common Prefixes

ndash B=Burrows-Wheelers Transformed text

bull A maximal substring that occurs more than once corresponds to a internal node of Suffix Tree which is equivalent to a suffix with Lgt0 and BWT has more than 1 character type

ndash They can be calculated on linear time

bull esaxx Okanoharas implement of ESA

ndash httpcodegooglecompesaxx

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 32

via [Okanohara+ 09]

Corpus and Normalization

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 33

Target Languages

bull Limit character type to detect

ndash In short text detection mixed text can be

divided to type of characters

bull Latin alphabet language

ndash The most difficult alphabet type to detect

ndash Languages which speakers are over 5

million are more than 25

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 34

Whats Latin Alphabet

bull Latin alphabet ne ascii alphabet

ndash aring ą aelig eth Ħ ŋ and so on

bull They are assigned to 9 code blocks in Unicode

Range Name Supplement

U+0000-007F Basic Latin ascii

U+0080-00FF Latin-1 Supplement Most languages are covered with these U+0100-017F Latin Extended-A

U+0180-024F Latin Extended-B Rumanian

U+0250-02AF IPA Extensions

U+0300-036F Combining Diacritical Marks for tone symbol composition

U+1E00-1EFF Latin Extended Additional Vietnamese

U+2C60-2C7F Latin Extended-C These arenrsquot used by almost all present languages U+A720-A7FF Latin Extended-D

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 35

Latin Alphabets in Unicode Codepoint Chart

for Vietnamese only use often use sometimes

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 36

How to Create Corpus

bull Collect tweets with sample method of

twitter Streaming API

ndash Sampling 1 of all tweets (about 2

million tweets)

ndash Tweets in Latin alphabet language

account for 60 of them

bull The rest is only to annotate language

labels to these tweets

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 37

Language Label Annotation

bull Group tweets by their timezone

ndash French tweets account for about 1 of all ones

ndash But they account for 50 of ones in Paris

timezone only

bull Annotate tentative labels to tweets using

langdetect

ndash Remove non-French tweets from ones labeled lsquofrrsquo

ndash Recover French tweets from ones not labeled lsquofrrsquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 38

( 20 of the whole tweets have no timezone)

How to annotate

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 39

Swedish Norwegian Danish Vietnamese Lithuanian

Czech Hungarian Catalan Rumanian and Polish guides in turn

Created Corpus

bull Noiseless tweets for training data

bull Noiseful tweets with more than 3 words as test data

bull Work with Rauacutel Velaz and Hiroshi Manabe for Catalan corpus creation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 40

language training testca Catalan 9089 5082cs Czech 9082 7682da Dannish 7388 5524de German 44448 10065en English 44520 10168es Spanish 44118 10265fi Finnish 8087 7050fr French 44339 10098hu Hungarian 10030 4904id Indonesian 44722 10181it Italian 43366 10152nl Dutch 44682 10007no Norwegian 10124 8496pl Polish 16771 10152pt Portuguese 44215 10208ro Romanian 10021 5911sv Swedish 44054 10032tr Turkish 44703 10308vi Vietnamese 15030 10488

total 538789 166773

Simple Language Detection

bull Language detector can be constructed

from maximal substring model and

twitter corpus

ndash It still gets at most 98 accuracy

bull We guess it is necessary to reduce bias

ndash data size bias

ndash language-specific bias

ndash twitter-specific bias

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 41

Bias by Data Size

bull Tweet size in each language has huge bias

bull Level them out by sampling with replacement from each language up to the largest data

ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement

English

Portuguese

Spanish

Indonesian

Dutch

French

German

Turkish

Italian

Swedish

othersShort Text Language Detection with Infinity-Gram

(NAIST Seminar) 42

Convert to Lowercase on Multiple Languages

bull Conversion into lower case saves corpus and compresses model

bull But the lower case of I (U+0049) in Turkish differs from others

bull Convert to lower case excluding lsquoIrsquo

Upper case Lower case

Turkish

Azerbaijani

I (U+0049) ı (U+0131)

İ (U+0130) i (U+0069)

Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 43

Normalization for Rumanian

bull Rumanian uses acirc ă icirc ș ț in addition to a-z

bull There are 2 character type as st with a ldquobeardrdquo

ndash U+015E-F U+0162-3 st with cedilla

ndash U+0218-B st with comma below

bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia

bull The 2 code has the same design in some fonts

ndash Indistinguishable

ș ş U+0219 U+015F

ț ţ U+021B U+0163

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

44

Rumanian Character Affairs on PC

bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently

ndash 1989 Democratization in Rumania

ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode

ndash 2007 Rumania seated in the EU

ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 45

lsquost with cedillarsquo is used

on an advertisement board

in Bucharest

Normalization for Substitute Characters

bull lsquost with cedillarsquo are substitute characters

ndash But they are more popular than the others

ndash with cedilla with comma = 2 1

ndash ldquoRumanian IMErdquo outputs the substitutes too D

bull Regard lsquost with commarsquo as lsquost with cedillarsquo

ț ţ U+021B U+0163

I reckon it is similar to the relationship of

Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 46

Arabic Character Normalization (on language-detection)

bull Arabic and Persian have the similar trouble too

bull Character lsquoyehrsquo in Farsi corresponds to 2 code points

ndash Wikipedia uses ی (U+06cc Farsi yeh) only

ndash News uses ي(U+064a Arabic yeh) only

bull U+064a is a substitute in Farsi

ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc

ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails

bull Regard U+06cc as U+064a

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 47

Normalization for Vietnamese (1)

bull Vietnamese has 12 vowels

ndash a ă acirc e ecirc i y o ocirc ơ u ư

bull Vietnamese has 6 tones

ndash a ả agrave atilde aacute ạ

ndash These tone symbols are used also in general documents like news

bull The tone symbols can be appended to all vowels

ndash 12 6 = 72

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 48

Normalization for Vietnamese (2)

bull Representation of vowels with

tones

1 Use U+1ea0 - U+1ef9

bull ẵ = U+1eb5

2 Combine with Diacritical Marks

bull ẵ = U+0103 U+0303

ndash Half and half on news and tweet

bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 49

CJK-Kanji Normalization (1) (on language-detection)

bull CJK-Kanji has too many characters(more than 20K)

ndash Other character types have only 30-50 characters

bull The character space is very sparse

ndash Characters that donrsquot occur in the training corpus have no probabilities

bull eg 谢谢 Kanji for person name

ndash Common frequent characters are too strong

bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese

bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 50

CJK-Kanji Normalization (2) (on language-detection)

bull Group Kanjis by frequency and normalize each group to the representative character

ndash (1) K-means clustering

bull Use tf-idf on Wikipedia and Google News

bull K=50 (size of ascii alphabet = 52)

ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese

bull Simplified Chinese 现代汉语常用字表(3500)

bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)

bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998

ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much

bull Generate 130 clusters from product of (1) and (2)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 51

Normalization for twitter

bull Remove simply

ndash URL

ndash mention

ndash hash tag

ndash RT

ndash face mark using alphabet like XD p

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 52

Normalization for twitter-Specific Representation

bull How to Like lsquocoooooooollllllrsquo

bull Case 1 Make a normalization dictionary using [Brody+ 2011]

ndash Unsupervised normalization like coooollll rarr cool

ndash It canrsquot handle words that are not in the dictionary

bull Case 2 If the same character continues in more than 3 Shrink it to 2

ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of

bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on

bull Acronym (like WWW СССР) is not useful for language detection

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 53

Laugh Normalization

bull There are various laughs on each language

ndash HOW MUCH DO YOU LOVE COACH BEISTE

HHAHAHAHAHAH

ndash Hihihihi ) Habe ich regulaumlr 2x die Woche

ndash Tafil con eso Jajajajajajaja

ndash Malo Jejejeje XP

ndash kekeke chỗ đoacute lagravem aacuteo được ko em

bull Shrink them to double

ndash hahahha rArr haha

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 54

Implementation and Estimation

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 55

Language Detection with Infinity-Gram (ldig)

bull tweet language detection for Latin

alphabet

ndash httpsgithubcomshuyoldig

bull MIT license

bull Distribute also the trained model here

ndash infin-gram LR(maximal substring) [Okanohara+ 09]

ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]

ndash Double Array

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 56

Usage (1) Model Initialization

bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]

ndash Extract features from corpus and initialize model

ndash -m model directory

ndash -x path of maximal substring extractor (execute as external process)

ndash --ff Ignore less than the specified value

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 57

Maximal String Extractor

bull maxsubst [input file] [output file]

ndash Input as multiple line text

bull Replace TABs to ldquo ldquo line feeds to U+0001 in it

ndash Output as rdquo[features]yent[frequency]rdquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 58

Usage (2) Learn

bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]

ndash Learn the model using the corpus on 1 cycle of SGD

ndash -e learning rate of SGD

ndash -r regularizer of L1 regularization

ndash --wr what times to regularize for whole parameters

bull Parameters are too many to regularize the whle ones every step

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 59

Usage (3) Shrink Model

bull ldigpy -m [model] --shrink

ndash Remove Unefficient features(all

parameters of which are 0) from the

model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 60

Usage (4) Detect Language

bull ldigpy -m [model] [test data]

ndash Detect languages of test data and output

its result and summary

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 61

Data Format

bull Training and test data

ndash [correct label]yent[meta data]yent[text]

en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD

62

Usage (5) Estimation Tool

bull serverpy -m [model] -p [port number]

ndash Open httplocalhost[port] after it is executed

ndash Output their language probabilities contained features and their parameters for a text inputed in the text area

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 63

Estimation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus

As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size

64

language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992

total 166711 165053 9901 922 974

Estimation for LIGA dataset

bull Estimate using LIGA[Tromp+ 11] dataset

with 9066 tweets for 6 languages

ndash httpwwwwintuenl~mpechenprojectssmm

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Use 19 language model

65

Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996

total 9066 8992 992

Estimation for Europarl Dataset

Only supported languages for ldig

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 66

ldig langdetect CLDlanguage size correct rate correct rate correct rate

bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993

total 21000 13957 997 20850 993 20814 991

Conclusions

bull Language detector using maximal substring model

ndash Detect over 99 accuracy for 19 languages

ndash langdetect with tweet corpus even has 97 accuracy

bull If the corpus is maintained the precision will be still up

ndash There are still many mistakes (in particular da and no)

bull If metadata is added to features the precision will be still up

ndash How to add and train metadata at low cost

bull Desire to shrink the model without loss of precision

ndash Too large for application (gt100MB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 67

References

bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定

bull [Okanohara+ 09] Text Categorization with All Substring Features

bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs

bull [Cavnar+ 94] N-Gram-Based Text Categorization

bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 68

Page 23: Short Text Language Detection with Infinity-Gram

We need

bull Rich feature extractable model from

short text

ndash Maximal substring model

(infin-gram Logistic Regression)

bull and twitter-specific Language model

or Corpus to construct it

ndash about 700K tweet corpus with language

label

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 25

Proposal Method

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 26

How to increase features from 3-grams

bull The more n the more features

bull Maximum at n=infin that is all substring

ndash But it has O(T2) order

gram of n-gram

freq≧1 freq≧2 freq≧10

1 79 72 57

2 1896 1533 902

3 15970 10369 4525

4 64966 33941 10534

5 167543 69719 15538

6 323749 107861 18970

7 524634 142954 21093

8 760719 171995 22159

9 921361 193995 22696

cumulative distributuion of feature length for 5090 normalized English tweets (300KB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 27

Text Categorization with All Substring Features [Okanohara+ 09]

bull Multiclass Logistic Regression using all

substrings as features

ndash Maximal Substring makes the equivalent

model that can be constructed in linear

time

ndash Store features into TRIE fast prediction

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 28

Maximal Substring (1)

bull Define a containment(semi-order)

among non empty substrings

abracadabra

ndash ldquorardquo sub ldquobraldquo hArr all rdquorardquo occur

as the substring of ldquobrardquo

ndash ldquoardquo nsub ldquoraldquo hArr ldquoardquo occur in not only ldquoraldquo

but also ldquocardquo It is strictly defined with also its position in the substring

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 29

Maximal Substring (2)

bull Each equivalent class formed by the containment relationship has a unique maximal element that is named Maximal Substring

bull Maximal substrings of abracadabra are a abra and abracadabra

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 30

via httpdhatenanejpnokuno201202031328237067

Maximal Substring and Infinity-Gram

bull Frequencies of substrings that have a containment relationship always equal

bull In the model with linear combination of features it is possible to enclose the common feature values

bull Logistic regression with maximal substrings is equivalent to the one with infinity-grams

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Although the equivalence collapses for test set

we assumes that it can be approximated by a sufficiently large training set

31

Extended Suffix Array

bull Extended Suffix Array consists of

ndash SA=Suffix Array

ndash L=Longest Common Prefixes

ndash B=Burrows-Wheelers Transformed text

bull A maximal substring that occurs more than once corresponds to a internal node of Suffix Tree which is equivalent to a suffix with Lgt0 and BWT has more than 1 character type

ndash They can be calculated on linear time

bull esaxx Okanoharas implement of ESA

ndash httpcodegooglecompesaxx

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 32

via [Okanohara+ 09]

Corpus and Normalization

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 33

Target Languages

bull Limit character type to detect

ndash In short text detection mixed text can be

divided to type of characters

bull Latin alphabet language

ndash The most difficult alphabet type to detect

ndash Languages which speakers are over 5

million are more than 25

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 34

Whats Latin Alphabet

bull Latin alphabet ne ascii alphabet

ndash aring ą aelig eth Ħ ŋ and so on

bull They are assigned to 9 code blocks in Unicode

Range Name Supplement

U+0000-007F Basic Latin ascii

U+0080-00FF Latin-1 Supplement Most languages are covered with these U+0100-017F Latin Extended-A

U+0180-024F Latin Extended-B Rumanian

U+0250-02AF IPA Extensions

U+0300-036F Combining Diacritical Marks for tone symbol composition

U+1E00-1EFF Latin Extended Additional Vietnamese

U+2C60-2C7F Latin Extended-C These arenrsquot used by almost all present languages U+A720-A7FF Latin Extended-D

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 35

Latin Alphabets in Unicode Codepoint Chart

for Vietnamese only use often use sometimes

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 36

How to Create Corpus

bull Collect tweets with sample method of

twitter Streaming API

ndash Sampling 1 of all tweets (about 2

million tweets)

ndash Tweets in Latin alphabet language

account for 60 of them

bull The rest is only to annotate language

labels to these tweets

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 37

Language Label Annotation

bull Group tweets by their timezone

ndash French tweets account for about 1 of all ones

ndash But they account for 50 of ones in Paris

timezone only

bull Annotate tentative labels to tweets using

langdetect

ndash Remove non-French tweets from ones labeled lsquofrrsquo

ndash Recover French tweets from ones not labeled lsquofrrsquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 38

( 20 of the whole tweets have no timezone)

How to annotate

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 39

Swedish Norwegian Danish Vietnamese Lithuanian

Czech Hungarian Catalan Rumanian and Polish guides in turn

Created Corpus

bull Noiseless tweets for training data

bull Noiseful tweets with more than 3 words as test data

bull Work with Rauacutel Velaz and Hiroshi Manabe for Catalan corpus creation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 40

language training testca Catalan 9089 5082cs Czech 9082 7682da Dannish 7388 5524de German 44448 10065en English 44520 10168es Spanish 44118 10265fi Finnish 8087 7050fr French 44339 10098hu Hungarian 10030 4904id Indonesian 44722 10181it Italian 43366 10152nl Dutch 44682 10007no Norwegian 10124 8496pl Polish 16771 10152pt Portuguese 44215 10208ro Romanian 10021 5911sv Swedish 44054 10032tr Turkish 44703 10308vi Vietnamese 15030 10488

total 538789 166773

Simple Language Detection

bull Language detector can be constructed

from maximal substring model and

twitter corpus

ndash It still gets at most 98 accuracy

bull We guess it is necessary to reduce bias

ndash data size bias

ndash language-specific bias

ndash twitter-specific bias

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 41

Bias by Data Size

bull Tweet size in each language has huge bias

bull Level them out by sampling with replacement from each language up to the largest data

ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement

English

Portuguese

Spanish

Indonesian

Dutch

French

German

Turkish

Italian

Swedish

othersShort Text Language Detection with Infinity-Gram

(NAIST Seminar) 42

Convert to Lowercase on Multiple Languages

bull Conversion into lower case saves corpus and compresses model

bull But the lower case of I (U+0049) in Turkish differs from others

bull Convert to lower case excluding lsquoIrsquo

Upper case Lower case

Turkish

Azerbaijani

I (U+0049) ı (U+0131)

İ (U+0130) i (U+0069)

Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 43

Normalization for Rumanian

bull Rumanian uses acirc ă icirc ș ț in addition to a-z

bull There are 2 character type as st with a ldquobeardrdquo

ndash U+015E-F U+0162-3 st with cedilla

ndash U+0218-B st with comma below

bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia

bull The 2 code has the same design in some fonts

ndash Indistinguishable

ș ş U+0219 U+015F

ț ţ U+021B U+0163

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

44

Rumanian Character Affairs on PC

bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently

ndash 1989 Democratization in Rumania

ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode

ndash 2007 Rumania seated in the EU

ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 45

lsquost with cedillarsquo is used

on an advertisement board

in Bucharest

Normalization for Substitute Characters

bull lsquost with cedillarsquo are substitute characters

ndash But they are more popular than the others

ndash with cedilla with comma = 2 1

ndash ldquoRumanian IMErdquo outputs the substitutes too D

bull Regard lsquost with commarsquo as lsquost with cedillarsquo

ț ţ U+021B U+0163

I reckon it is similar to the relationship of

Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 46

Arabic Character Normalization (on language-detection)

bull Arabic and Persian have the similar trouble too

bull Character lsquoyehrsquo in Farsi corresponds to 2 code points

ndash Wikipedia uses ی (U+06cc Farsi yeh) only

ndash News uses ي(U+064a Arabic yeh) only

bull U+064a is a substitute in Farsi

ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc

ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails

bull Regard U+06cc as U+064a

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 47

Normalization for Vietnamese (1)

bull Vietnamese has 12 vowels

ndash a ă acirc e ecirc i y o ocirc ơ u ư

bull Vietnamese has 6 tones

ndash a ả agrave atilde aacute ạ

ndash These tone symbols are used also in general documents like news

bull The tone symbols can be appended to all vowels

ndash 12 6 = 72

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 48

Normalization for Vietnamese (2)

bull Representation of vowels with

tones

1 Use U+1ea0 - U+1ef9

bull ẵ = U+1eb5

2 Combine with Diacritical Marks

bull ẵ = U+0103 U+0303

ndash Half and half on news and tweet

bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 49

CJK-Kanji Normalization (1) (on language-detection)

bull CJK-Kanji has too many characters(more than 20K)

ndash Other character types have only 30-50 characters

bull The character space is very sparse

ndash Characters that donrsquot occur in the training corpus have no probabilities

bull eg 谢谢 Kanji for person name

ndash Common frequent characters are too strong

bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese

bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 50

CJK-Kanji Normalization (2) (on language-detection)

bull Group Kanjis by frequency and normalize each group to the representative character

ndash (1) K-means clustering

bull Use tf-idf on Wikipedia and Google News

bull K=50 (size of ascii alphabet = 52)

ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese

bull Simplified Chinese 现代汉语常用字表(3500)

bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)

bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998

ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much

bull Generate 130 clusters from product of (1) and (2)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 51

Normalization for twitter

bull Remove simply

ndash URL

ndash mention

ndash hash tag

ndash RT

ndash face mark using alphabet like XD p

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 52

Normalization for twitter-Specific Representation

bull How to Like lsquocoooooooollllllrsquo

bull Case 1 Make a normalization dictionary using [Brody+ 2011]

ndash Unsupervised normalization like coooollll rarr cool

ndash It canrsquot handle words that are not in the dictionary

bull Case 2 If the same character continues in more than 3 Shrink it to 2

ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of

bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on

bull Acronym (like WWW СССР) is not useful for language detection

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 53

Laugh Normalization

bull There are various laughs on each language

ndash HOW MUCH DO YOU LOVE COACH BEISTE

HHAHAHAHAHAH

ndash Hihihihi ) Habe ich regulaumlr 2x die Woche

ndash Tafil con eso Jajajajajajaja

ndash Malo Jejejeje XP

ndash kekeke chỗ đoacute lagravem aacuteo được ko em

bull Shrink them to double

ndash hahahha rArr haha

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 54

Implementation and Estimation

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 55

Language Detection with Infinity-Gram (ldig)

bull tweet language detection for Latin

alphabet

ndash httpsgithubcomshuyoldig

bull MIT license

bull Distribute also the trained model here

ndash infin-gram LR(maximal substring) [Okanohara+ 09]

ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]

ndash Double Array

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 56

Usage (1) Model Initialization

bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]

ndash Extract features from corpus and initialize model

ndash -m model directory

ndash -x path of maximal substring extractor (execute as external process)

ndash --ff Ignore less than the specified value

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 57

Maximal String Extractor

bull maxsubst [input file] [output file]

ndash Input as multiple line text

bull Replace TABs to ldquo ldquo line feeds to U+0001 in it

ndash Output as rdquo[features]yent[frequency]rdquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 58

Usage (2) Learn

bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]

ndash Learn the model using the corpus on 1 cycle of SGD

ndash -e learning rate of SGD

ndash -r regularizer of L1 regularization

ndash --wr what times to regularize for whole parameters

bull Parameters are too many to regularize the whle ones every step

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 59

Usage (3) Shrink Model

bull ldigpy -m [model] --shrink

ndash Remove Unefficient features(all

parameters of which are 0) from the

model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 60

Usage (4) Detect Language

bull ldigpy -m [model] [test data]

ndash Detect languages of test data and output

its result and summary

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 61

Data Format

bull Training and test data

ndash [correct label]yent[meta data]yent[text]

en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD

62

Usage (5) Estimation Tool

bull serverpy -m [model] -p [port number]

ndash Open httplocalhost[port] after it is executed

ndash Output their language probabilities contained features and their parameters for a text inputed in the text area

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 63

Estimation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus

As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size

64

language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992

total 166711 165053 9901 922 974

Estimation for LIGA dataset

bull Estimate using LIGA[Tromp+ 11] dataset

with 9066 tweets for 6 languages

ndash httpwwwwintuenl~mpechenprojectssmm

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Use 19 language model

65

Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996

total 9066 8992 992

Estimation for Europarl Dataset

Only supported languages for ldig

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 66

ldig langdetect CLDlanguage size correct rate correct rate correct rate

bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993

total 21000 13957 997 20850 993 20814 991

Conclusions

bull Language detector using maximal substring model

ndash Detect over 99 accuracy for 19 languages

ndash langdetect with tweet corpus even has 97 accuracy

bull If the corpus is maintained the precision will be still up

ndash There are still many mistakes (in particular da and no)

bull If metadata is added to features the precision will be still up

ndash How to add and train metadata at low cost

bull Desire to shrink the model without loss of precision

ndash Too large for application (gt100MB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 67

References

bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定

bull [Okanohara+ 09] Text Categorization with All Substring Features

bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs

bull [Cavnar+ 94] N-Gram-Based Text Categorization

bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 68

Page 24: Short Text Language Detection with Infinity-Gram

Proposal Method

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 26

How to increase features from 3-grams

bull The more n the more features

bull Maximum at n=infin that is all substring

ndash But it has O(T2) order

gram of n-gram

freq≧1 freq≧2 freq≧10

1 79 72 57

2 1896 1533 902

3 15970 10369 4525

4 64966 33941 10534

5 167543 69719 15538

6 323749 107861 18970

7 524634 142954 21093

8 760719 171995 22159

9 921361 193995 22696

cumulative distributuion of feature length for 5090 normalized English tweets (300KB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 27

Text Categorization with All Substring Features [Okanohara+ 09]

bull Multiclass Logistic Regression using all

substrings as features

ndash Maximal Substring makes the equivalent

model that can be constructed in linear

time

ndash Store features into TRIE fast prediction

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 28

Maximal Substring (1)

bull Define a containment(semi-order)

among non empty substrings

abracadabra

ndash ldquorardquo sub ldquobraldquo hArr all rdquorardquo occur

as the substring of ldquobrardquo

ndash ldquoardquo nsub ldquoraldquo hArr ldquoardquo occur in not only ldquoraldquo

but also ldquocardquo It is strictly defined with also its position in the substring

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 29

Maximal Substring (2)

bull Each equivalent class formed by the containment relationship has a unique maximal element that is named Maximal Substring

bull Maximal substrings of abracadabra are a abra and abracadabra

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 30

via httpdhatenanejpnokuno201202031328237067

Maximal Substring and Infinity-Gram

bull Frequencies of substrings that have a containment relationship always equal

bull In the model with linear combination of features it is possible to enclose the common feature values

bull Logistic regression with maximal substrings is equivalent to the one with infinity-grams

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Although the equivalence collapses for test set

we assumes that it can be approximated by a sufficiently large training set

31

Extended Suffix Array

bull Extended Suffix Array consists of

ndash SA=Suffix Array

ndash L=Longest Common Prefixes

ndash B=Burrows-Wheelers Transformed text

bull A maximal substring that occurs more than once corresponds to a internal node of Suffix Tree which is equivalent to a suffix with Lgt0 and BWT has more than 1 character type

ndash They can be calculated on linear time

bull esaxx Okanoharas implement of ESA

ndash httpcodegooglecompesaxx

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 32

via [Okanohara+ 09]

Corpus and Normalization

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 33

Target Languages

bull Limit character type to detect

ndash In short text detection mixed text can be

divided to type of characters

bull Latin alphabet language

ndash The most difficult alphabet type to detect

ndash Languages which speakers are over 5

million are more than 25

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 34

Whats Latin Alphabet

bull Latin alphabet ne ascii alphabet

ndash aring ą aelig eth Ħ ŋ and so on

bull They are assigned to 9 code blocks in Unicode

Range Name Supplement

U+0000-007F Basic Latin ascii

U+0080-00FF Latin-1 Supplement Most languages are covered with these U+0100-017F Latin Extended-A

U+0180-024F Latin Extended-B Rumanian

U+0250-02AF IPA Extensions

U+0300-036F Combining Diacritical Marks for tone symbol composition

U+1E00-1EFF Latin Extended Additional Vietnamese

U+2C60-2C7F Latin Extended-C These arenrsquot used by almost all present languages U+A720-A7FF Latin Extended-D

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 35

Latin Alphabets in Unicode Codepoint Chart

for Vietnamese only use often use sometimes

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 36

How to Create Corpus

bull Collect tweets with sample method of

twitter Streaming API

ndash Sampling 1 of all tweets (about 2

million tweets)

ndash Tweets in Latin alphabet language

account for 60 of them

bull The rest is only to annotate language

labels to these tweets

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 37

Language Label Annotation

bull Group tweets by their timezone

ndash French tweets account for about 1 of all ones

ndash But they account for 50 of ones in Paris

timezone only

bull Annotate tentative labels to tweets using

langdetect

ndash Remove non-French tweets from ones labeled lsquofrrsquo

ndash Recover French tweets from ones not labeled lsquofrrsquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 38

( 20 of the whole tweets have no timezone)

How to annotate

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 39

Swedish Norwegian Danish Vietnamese Lithuanian

Czech Hungarian Catalan Rumanian and Polish guides in turn

Created Corpus

bull Noiseless tweets for training data

bull Noiseful tweets with more than 3 words as test data

bull Work with Rauacutel Velaz and Hiroshi Manabe for Catalan corpus creation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 40

language training testca Catalan 9089 5082cs Czech 9082 7682da Dannish 7388 5524de German 44448 10065en English 44520 10168es Spanish 44118 10265fi Finnish 8087 7050fr French 44339 10098hu Hungarian 10030 4904id Indonesian 44722 10181it Italian 43366 10152nl Dutch 44682 10007no Norwegian 10124 8496pl Polish 16771 10152pt Portuguese 44215 10208ro Romanian 10021 5911sv Swedish 44054 10032tr Turkish 44703 10308vi Vietnamese 15030 10488

total 538789 166773

Simple Language Detection

bull Language detector can be constructed

from maximal substring model and

twitter corpus

ndash It still gets at most 98 accuracy

bull We guess it is necessary to reduce bias

ndash data size bias

ndash language-specific bias

ndash twitter-specific bias

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 41

Bias by Data Size

bull Tweet size in each language has huge bias

bull Level them out by sampling with replacement from each language up to the largest data

ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement

English

Portuguese

Spanish

Indonesian

Dutch

French

German

Turkish

Italian

Swedish

othersShort Text Language Detection with Infinity-Gram

(NAIST Seminar) 42

Convert to Lowercase on Multiple Languages

bull Conversion into lower case saves corpus and compresses model

bull But the lower case of I (U+0049) in Turkish differs from others

bull Convert to lower case excluding lsquoIrsquo

Upper case Lower case

Turkish

Azerbaijani

I (U+0049) ı (U+0131)

İ (U+0130) i (U+0069)

Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 43

Normalization for Rumanian

bull Rumanian uses acirc ă icirc ș ț in addition to a-z

bull There are 2 character type as st with a ldquobeardrdquo

ndash U+015E-F U+0162-3 st with cedilla

ndash U+0218-B st with comma below

bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia

bull The 2 code has the same design in some fonts

ndash Indistinguishable

ș ş U+0219 U+015F

ț ţ U+021B U+0163

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

44

Rumanian Character Affairs on PC

bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently

ndash 1989 Democratization in Rumania

ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode

ndash 2007 Rumania seated in the EU

ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 45

lsquost with cedillarsquo is used

on an advertisement board

in Bucharest

Normalization for Substitute Characters

bull lsquost with cedillarsquo are substitute characters

ndash But they are more popular than the others

ndash with cedilla with comma = 2 1

ndash ldquoRumanian IMErdquo outputs the substitutes too D

bull Regard lsquost with commarsquo as lsquost with cedillarsquo

ț ţ U+021B U+0163

I reckon it is similar to the relationship of

Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 46

Arabic Character Normalization (on language-detection)

bull Arabic and Persian have the similar trouble too

bull Character lsquoyehrsquo in Farsi corresponds to 2 code points

ndash Wikipedia uses ی (U+06cc Farsi yeh) only

ndash News uses ي(U+064a Arabic yeh) only

bull U+064a is a substitute in Farsi

ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc

ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails

bull Regard U+06cc as U+064a

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 47

Normalization for Vietnamese (1)

bull Vietnamese has 12 vowels

ndash a ă acirc e ecirc i y o ocirc ơ u ư

bull Vietnamese has 6 tones

ndash a ả agrave atilde aacute ạ

ndash These tone symbols are used also in general documents like news

bull The tone symbols can be appended to all vowels

ndash 12 6 = 72

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 48

Normalization for Vietnamese (2)

bull Representation of vowels with

tones

1 Use U+1ea0 - U+1ef9

bull ẵ = U+1eb5

2 Combine with Diacritical Marks

bull ẵ = U+0103 U+0303

ndash Half and half on news and tweet

bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 49

CJK-Kanji Normalization (1) (on language-detection)

bull CJK-Kanji has too many characters(more than 20K)

ndash Other character types have only 30-50 characters

bull The character space is very sparse

ndash Characters that donrsquot occur in the training corpus have no probabilities

bull eg 谢谢 Kanji for person name

ndash Common frequent characters are too strong

bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese

bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 50

CJK-Kanji Normalization (2) (on language-detection)

bull Group Kanjis by frequency and normalize each group to the representative character

ndash (1) K-means clustering

bull Use tf-idf on Wikipedia and Google News

bull K=50 (size of ascii alphabet = 52)

ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese

bull Simplified Chinese 现代汉语常用字表(3500)

bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)

bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998

ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much

bull Generate 130 clusters from product of (1) and (2)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 51

Normalization for twitter

bull Remove simply

ndash URL

ndash mention

ndash hash tag

ndash RT

ndash face mark using alphabet like XD p

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 52

Normalization for twitter-Specific Representation

bull How to Like lsquocoooooooollllllrsquo

bull Case 1 Make a normalization dictionary using [Brody+ 2011]

ndash Unsupervised normalization like coooollll rarr cool

ndash It canrsquot handle words that are not in the dictionary

bull Case 2 If the same character continues in more than 3 Shrink it to 2

ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of

bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on

bull Acronym (like WWW СССР) is not useful for language detection

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 53

Laugh Normalization

bull There are various laughs on each language

ndash HOW MUCH DO YOU LOVE COACH BEISTE

HHAHAHAHAHAH

ndash Hihihihi ) Habe ich regulaumlr 2x die Woche

ndash Tafil con eso Jajajajajajaja

ndash Malo Jejejeje XP

ndash kekeke chỗ đoacute lagravem aacuteo được ko em

bull Shrink them to double

ndash hahahha rArr haha

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 54

Implementation and Estimation

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 55

Language Detection with Infinity-Gram (ldig)

bull tweet language detection for Latin

alphabet

ndash httpsgithubcomshuyoldig

bull MIT license

bull Distribute also the trained model here

ndash infin-gram LR(maximal substring) [Okanohara+ 09]

ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]

ndash Double Array

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 56

Usage (1) Model Initialization

bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]

ndash Extract features from corpus and initialize model

ndash -m model directory

ndash -x path of maximal substring extractor (execute as external process)

ndash --ff Ignore less than the specified value

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 57

Maximal String Extractor

bull maxsubst [input file] [output file]

ndash Input as multiple line text

bull Replace TABs to ldquo ldquo line feeds to U+0001 in it

ndash Output as rdquo[features]yent[frequency]rdquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 58

Usage (2) Learn

bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]

ndash Learn the model using the corpus on 1 cycle of SGD

ndash -e learning rate of SGD

ndash -r regularizer of L1 regularization

ndash --wr what times to regularize for whole parameters

bull Parameters are too many to regularize the whle ones every step

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 59

Usage (3) Shrink Model

bull ldigpy -m [model] --shrink

ndash Remove Unefficient features(all

parameters of which are 0) from the

model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 60

Usage (4) Detect Language

bull ldigpy -m [model] [test data]

ndash Detect languages of test data and output

its result and summary

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 61

Data Format

bull Training and test data

ndash [correct label]yent[meta data]yent[text]

en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD

62

Usage (5) Estimation Tool

bull serverpy -m [model] -p [port number]

ndash Open httplocalhost[port] after it is executed

ndash Output their language probabilities contained features and their parameters for a text inputed in the text area

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 63

Estimation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus

As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size

64

language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992

total 166711 165053 9901 922 974

Estimation for LIGA dataset

bull Estimate using LIGA[Tromp+ 11] dataset

with 9066 tweets for 6 languages

ndash httpwwwwintuenl~mpechenprojectssmm

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Use 19 language model

65

Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996

total 9066 8992 992

Estimation for Europarl Dataset

Only supported languages for ldig

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 66

ldig langdetect CLDlanguage size correct rate correct rate correct rate

bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993

total 21000 13957 997 20850 993 20814 991

Conclusions

bull Language detector using maximal substring model

ndash Detect over 99 accuracy for 19 languages

ndash langdetect with tweet corpus even has 97 accuracy

bull If the corpus is maintained the precision will be still up

ndash There are still many mistakes (in particular da and no)

bull If metadata is added to features the precision will be still up

ndash How to add and train metadata at low cost

bull Desire to shrink the model without loss of precision

ndash Too large for application (gt100MB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 67

References

bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定

bull [Okanohara+ 09] Text Categorization with All Substring Features

bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs

bull [Cavnar+ 94] N-Gram-Based Text Categorization

bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 68

Page 25: Short Text Language Detection with Infinity-Gram

How to increase features from 3-grams

bull The more n the more features

bull Maximum at n=infin that is all substring

ndash But it has O(T2) order

gram of n-gram

freq≧1 freq≧2 freq≧10

1 79 72 57

2 1896 1533 902

3 15970 10369 4525

4 64966 33941 10534

5 167543 69719 15538

6 323749 107861 18970

7 524634 142954 21093

8 760719 171995 22159

9 921361 193995 22696

cumulative distributuion of feature length for 5090 normalized English tweets (300KB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 27

Text Categorization with All Substring Features [Okanohara+ 09]

bull Multiclass Logistic Regression using all

substrings as features

ndash Maximal Substring makes the equivalent

model that can be constructed in linear

time

ndash Store features into TRIE fast prediction

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 28

Maximal Substring (1)

bull Define a containment(semi-order)

among non empty substrings

abracadabra

ndash ldquorardquo sub ldquobraldquo hArr all rdquorardquo occur

as the substring of ldquobrardquo

ndash ldquoardquo nsub ldquoraldquo hArr ldquoardquo occur in not only ldquoraldquo

but also ldquocardquo It is strictly defined with also its position in the substring

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 29

Maximal Substring (2)

bull Each equivalent class formed by the containment relationship has a unique maximal element that is named Maximal Substring

bull Maximal substrings of abracadabra are a abra and abracadabra

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 30

via httpdhatenanejpnokuno201202031328237067

Maximal Substring and Infinity-Gram

bull Frequencies of substrings that have a containment relationship always equal

bull In the model with linear combination of features it is possible to enclose the common feature values

bull Logistic regression with maximal substrings is equivalent to the one with infinity-grams

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Although the equivalence collapses for test set

we assumes that it can be approximated by a sufficiently large training set

31

Extended Suffix Array

bull Extended Suffix Array consists of

ndash SA=Suffix Array

ndash L=Longest Common Prefixes

ndash B=Burrows-Wheelers Transformed text

bull A maximal substring that occurs more than once corresponds to a internal node of Suffix Tree which is equivalent to a suffix with Lgt0 and BWT has more than 1 character type

ndash They can be calculated on linear time

bull esaxx Okanoharas implement of ESA

ndash httpcodegooglecompesaxx

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 32

via [Okanohara+ 09]

Corpus and Normalization

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 33

Target Languages

bull Limit character type to detect

ndash In short text detection mixed text can be

divided to type of characters

bull Latin alphabet language

ndash The most difficult alphabet type to detect

ndash Languages which speakers are over 5

million are more than 25

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 34

Whats Latin Alphabet

bull Latin alphabet ne ascii alphabet

ndash aring ą aelig eth Ħ ŋ and so on

bull They are assigned to 9 code blocks in Unicode

Range Name Supplement

U+0000-007F Basic Latin ascii

U+0080-00FF Latin-1 Supplement Most languages are covered with these U+0100-017F Latin Extended-A

U+0180-024F Latin Extended-B Rumanian

U+0250-02AF IPA Extensions

U+0300-036F Combining Diacritical Marks for tone symbol composition

U+1E00-1EFF Latin Extended Additional Vietnamese

U+2C60-2C7F Latin Extended-C These arenrsquot used by almost all present languages U+A720-A7FF Latin Extended-D

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 35

Latin Alphabets in Unicode Codepoint Chart

for Vietnamese only use often use sometimes

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 36

How to Create Corpus

bull Collect tweets with sample method of

twitter Streaming API

ndash Sampling 1 of all tweets (about 2

million tweets)

ndash Tweets in Latin alphabet language

account for 60 of them

bull The rest is only to annotate language

labels to these tweets

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 37

Language Label Annotation

bull Group tweets by their timezone

ndash French tweets account for about 1 of all ones

ndash But they account for 50 of ones in Paris

timezone only

bull Annotate tentative labels to tweets using

langdetect

ndash Remove non-French tweets from ones labeled lsquofrrsquo

ndash Recover French tweets from ones not labeled lsquofrrsquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 38

( 20 of the whole tweets have no timezone)

How to annotate

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 39

Swedish Norwegian Danish Vietnamese Lithuanian

Czech Hungarian Catalan Rumanian and Polish guides in turn

Created Corpus

bull Noiseless tweets for training data

bull Noiseful tweets with more than 3 words as test data

bull Work with Rauacutel Velaz and Hiroshi Manabe for Catalan corpus creation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 40

language training testca Catalan 9089 5082cs Czech 9082 7682da Dannish 7388 5524de German 44448 10065en English 44520 10168es Spanish 44118 10265fi Finnish 8087 7050fr French 44339 10098hu Hungarian 10030 4904id Indonesian 44722 10181it Italian 43366 10152nl Dutch 44682 10007no Norwegian 10124 8496pl Polish 16771 10152pt Portuguese 44215 10208ro Romanian 10021 5911sv Swedish 44054 10032tr Turkish 44703 10308vi Vietnamese 15030 10488

total 538789 166773

Simple Language Detection

bull Language detector can be constructed

from maximal substring model and

twitter corpus

ndash It still gets at most 98 accuracy

bull We guess it is necessary to reduce bias

ndash data size bias

ndash language-specific bias

ndash twitter-specific bias

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 41

Bias by Data Size

bull Tweet size in each language has huge bias

bull Level them out by sampling with replacement from each language up to the largest data

ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement

English

Portuguese

Spanish

Indonesian

Dutch

French

German

Turkish

Italian

Swedish

othersShort Text Language Detection with Infinity-Gram

(NAIST Seminar) 42

Convert to Lowercase on Multiple Languages

bull Conversion into lower case saves corpus and compresses model

bull But the lower case of I (U+0049) in Turkish differs from others

bull Convert to lower case excluding lsquoIrsquo

Upper case Lower case

Turkish

Azerbaijani

I (U+0049) ı (U+0131)

İ (U+0130) i (U+0069)

Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 43

Normalization for Rumanian

bull Rumanian uses acirc ă icirc ș ț in addition to a-z

bull There are 2 character type as st with a ldquobeardrdquo

ndash U+015E-F U+0162-3 st with cedilla

ndash U+0218-B st with comma below

bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia

bull The 2 code has the same design in some fonts

ndash Indistinguishable

ș ş U+0219 U+015F

ț ţ U+021B U+0163

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

44

Rumanian Character Affairs on PC

bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently

ndash 1989 Democratization in Rumania

ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode

ndash 2007 Rumania seated in the EU

ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 45

lsquost with cedillarsquo is used

on an advertisement board

in Bucharest

Normalization for Substitute Characters

bull lsquost with cedillarsquo are substitute characters

ndash But they are more popular than the others

ndash with cedilla with comma = 2 1

ndash ldquoRumanian IMErdquo outputs the substitutes too D

bull Regard lsquost with commarsquo as lsquost with cedillarsquo

ț ţ U+021B U+0163

I reckon it is similar to the relationship of

Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 46

Arabic Character Normalization (on language-detection)

bull Arabic and Persian have the similar trouble too

bull Character lsquoyehrsquo in Farsi corresponds to 2 code points

ndash Wikipedia uses ی (U+06cc Farsi yeh) only

ndash News uses ي(U+064a Arabic yeh) only

bull U+064a is a substitute in Farsi

ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc

ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails

bull Regard U+06cc as U+064a

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 47

Normalization for Vietnamese (1)

bull Vietnamese has 12 vowels

ndash a ă acirc e ecirc i y o ocirc ơ u ư

bull Vietnamese has 6 tones

ndash a ả agrave atilde aacute ạ

ndash These tone symbols are used also in general documents like news

bull The tone symbols can be appended to all vowels

ndash 12 6 = 72

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 48

Normalization for Vietnamese (2)

bull Representation of vowels with

tones

1 Use U+1ea0 - U+1ef9

bull ẵ = U+1eb5

2 Combine with Diacritical Marks

bull ẵ = U+0103 U+0303

ndash Half and half on news and tweet

bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 49

CJK-Kanji Normalization (1) (on language-detection)

bull CJK-Kanji has too many characters(more than 20K)

ndash Other character types have only 30-50 characters

bull The character space is very sparse

ndash Characters that donrsquot occur in the training corpus have no probabilities

bull eg 谢谢 Kanji for person name

ndash Common frequent characters are too strong

bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese

bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 50

CJK-Kanji Normalization (2) (on language-detection)

bull Group Kanjis by frequency and normalize each group to the representative character

ndash (1) K-means clustering

bull Use tf-idf on Wikipedia and Google News

bull K=50 (size of ascii alphabet = 52)

ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese

bull Simplified Chinese 现代汉语常用字表(3500)

bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)

bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998

ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much

bull Generate 130 clusters from product of (1) and (2)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 51

Normalization for twitter

bull Remove simply

ndash URL

ndash mention

ndash hash tag

ndash RT

ndash face mark using alphabet like XD p

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 52

Normalization for twitter-Specific Representation

bull How to Like lsquocoooooooollllllrsquo

bull Case 1 Make a normalization dictionary using [Brody+ 2011]

ndash Unsupervised normalization like coooollll rarr cool

ndash It canrsquot handle words that are not in the dictionary

bull Case 2 If the same character continues in more than 3 Shrink it to 2

ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of

bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on

bull Acronym (like WWW СССР) is not useful for language detection

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 53

Laugh Normalization

bull There are various laughs on each language

ndash HOW MUCH DO YOU LOVE COACH BEISTE

HHAHAHAHAHAH

ndash Hihihihi ) Habe ich regulaumlr 2x die Woche

ndash Tafil con eso Jajajajajajaja

ndash Malo Jejejeje XP

ndash kekeke chỗ đoacute lagravem aacuteo được ko em

bull Shrink them to double

ndash hahahha rArr haha

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 54

Implementation and Estimation

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 55

Language Detection with Infinity-Gram (ldig)

bull tweet language detection for Latin

alphabet

ndash httpsgithubcomshuyoldig

bull MIT license

bull Distribute also the trained model here

ndash infin-gram LR(maximal substring) [Okanohara+ 09]

ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]

ndash Double Array

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 56

Usage (1) Model Initialization

bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]

ndash Extract features from corpus and initialize model

ndash -m model directory

ndash -x path of maximal substring extractor (execute as external process)

ndash --ff Ignore less than the specified value

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 57

Maximal String Extractor

bull maxsubst [input file] [output file]

ndash Input as multiple line text

bull Replace TABs to ldquo ldquo line feeds to U+0001 in it

ndash Output as rdquo[features]yent[frequency]rdquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 58

Usage (2) Learn

bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]

ndash Learn the model using the corpus on 1 cycle of SGD

ndash -e learning rate of SGD

ndash -r regularizer of L1 regularization

ndash --wr what times to regularize for whole parameters

bull Parameters are too many to regularize the whle ones every step

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 59

Usage (3) Shrink Model

bull ldigpy -m [model] --shrink

ndash Remove Unefficient features(all

parameters of which are 0) from the

model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 60

Usage (4) Detect Language

bull ldigpy -m [model] [test data]

ndash Detect languages of test data and output

its result and summary

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 61

Data Format

bull Training and test data

ndash [correct label]yent[meta data]yent[text]

en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD

62

Usage (5) Estimation Tool

bull serverpy -m [model] -p [port number]

ndash Open httplocalhost[port] after it is executed

ndash Output their language probabilities contained features and their parameters for a text inputed in the text area

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 63

Estimation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus

As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size

64

language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992

total 166711 165053 9901 922 974

Estimation for LIGA dataset

bull Estimate using LIGA[Tromp+ 11] dataset

with 9066 tweets for 6 languages

ndash httpwwwwintuenl~mpechenprojectssmm

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Use 19 language model

65

Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996

total 9066 8992 992

Estimation for Europarl Dataset

Only supported languages for ldig

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 66

ldig langdetect CLDlanguage size correct rate correct rate correct rate

bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993

total 21000 13957 997 20850 993 20814 991

Conclusions

bull Language detector using maximal substring model

ndash Detect over 99 accuracy for 19 languages

ndash langdetect with tweet corpus even has 97 accuracy

bull If the corpus is maintained the precision will be still up

ndash There are still many mistakes (in particular da and no)

bull If metadata is added to features the precision will be still up

ndash How to add and train metadata at low cost

bull Desire to shrink the model without loss of precision

ndash Too large for application (gt100MB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 67

References

bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定

bull [Okanohara+ 09] Text Categorization with All Substring Features

bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs

bull [Cavnar+ 94] N-Gram-Based Text Categorization

bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 68

Page 26: Short Text Language Detection with Infinity-Gram

Text Categorization with All Substring Features [Okanohara+ 09]

bull Multiclass Logistic Regression using all

substrings as features

ndash Maximal Substring makes the equivalent

model that can be constructed in linear

time

ndash Store features into TRIE fast prediction

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 28

Maximal Substring (1)

bull Define a containment(semi-order)

among non empty substrings

abracadabra

ndash ldquorardquo sub ldquobraldquo hArr all rdquorardquo occur

as the substring of ldquobrardquo

ndash ldquoardquo nsub ldquoraldquo hArr ldquoardquo occur in not only ldquoraldquo

but also ldquocardquo It is strictly defined with also its position in the substring

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 29

Maximal Substring (2)

bull Each equivalent class formed by the containment relationship has a unique maximal element that is named Maximal Substring

bull Maximal substrings of abracadabra are a abra and abracadabra

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 30

via httpdhatenanejpnokuno201202031328237067

Maximal Substring and Infinity-Gram

bull Frequencies of substrings that have a containment relationship always equal

bull In the model with linear combination of features it is possible to enclose the common feature values

bull Logistic regression with maximal substrings is equivalent to the one with infinity-grams

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Although the equivalence collapses for test set

we assumes that it can be approximated by a sufficiently large training set

31

Extended Suffix Array

bull Extended Suffix Array consists of

ndash SA=Suffix Array

ndash L=Longest Common Prefixes

ndash B=Burrows-Wheelers Transformed text

bull A maximal substring that occurs more than once corresponds to a internal node of Suffix Tree which is equivalent to a suffix with Lgt0 and BWT has more than 1 character type

ndash They can be calculated on linear time

bull esaxx Okanoharas implement of ESA

ndash httpcodegooglecompesaxx

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 32

via [Okanohara+ 09]

Corpus and Normalization

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 33

Target Languages

bull Limit character type to detect

ndash In short text detection mixed text can be

divided to type of characters

bull Latin alphabet language

ndash The most difficult alphabet type to detect

ndash Languages which speakers are over 5

million are more than 25

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 34

Whats Latin Alphabet

bull Latin alphabet ne ascii alphabet

ndash aring ą aelig eth Ħ ŋ and so on

bull They are assigned to 9 code blocks in Unicode

Range Name Supplement

U+0000-007F Basic Latin ascii

U+0080-00FF Latin-1 Supplement Most languages are covered with these U+0100-017F Latin Extended-A

U+0180-024F Latin Extended-B Rumanian

U+0250-02AF IPA Extensions

U+0300-036F Combining Diacritical Marks for tone symbol composition

U+1E00-1EFF Latin Extended Additional Vietnamese

U+2C60-2C7F Latin Extended-C These arenrsquot used by almost all present languages U+A720-A7FF Latin Extended-D

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 35

Latin Alphabets in Unicode Codepoint Chart

for Vietnamese only use often use sometimes

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 36

How to Create Corpus

bull Collect tweets with sample method of

twitter Streaming API

ndash Sampling 1 of all tweets (about 2

million tweets)

ndash Tweets in Latin alphabet language

account for 60 of them

bull The rest is only to annotate language

labels to these tweets

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 37

Language Label Annotation

bull Group tweets by their timezone

ndash French tweets account for about 1 of all ones

ndash But they account for 50 of ones in Paris

timezone only

bull Annotate tentative labels to tweets using

langdetect

ndash Remove non-French tweets from ones labeled lsquofrrsquo

ndash Recover French tweets from ones not labeled lsquofrrsquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 38

( 20 of the whole tweets have no timezone)

How to annotate

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 39

Swedish Norwegian Danish Vietnamese Lithuanian

Czech Hungarian Catalan Rumanian and Polish guides in turn

Created Corpus

bull Noiseless tweets for training data

bull Noiseful tweets with more than 3 words as test data

bull Work with Rauacutel Velaz and Hiroshi Manabe for Catalan corpus creation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 40

language training testca Catalan 9089 5082cs Czech 9082 7682da Dannish 7388 5524de German 44448 10065en English 44520 10168es Spanish 44118 10265fi Finnish 8087 7050fr French 44339 10098hu Hungarian 10030 4904id Indonesian 44722 10181it Italian 43366 10152nl Dutch 44682 10007no Norwegian 10124 8496pl Polish 16771 10152pt Portuguese 44215 10208ro Romanian 10021 5911sv Swedish 44054 10032tr Turkish 44703 10308vi Vietnamese 15030 10488

total 538789 166773

Simple Language Detection

bull Language detector can be constructed

from maximal substring model and

twitter corpus

ndash It still gets at most 98 accuracy

bull We guess it is necessary to reduce bias

ndash data size bias

ndash language-specific bias

ndash twitter-specific bias

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 41

Bias by Data Size

bull Tweet size in each language has huge bias

bull Level them out by sampling with replacement from each language up to the largest data

ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement

English

Portuguese

Spanish

Indonesian

Dutch

French

German

Turkish

Italian

Swedish

othersShort Text Language Detection with Infinity-Gram

(NAIST Seminar) 42

Convert to Lowercase on Multiple Languages

bull Conversion into lower case saves corpus and compresses model

bull But the lower case of I (U+0049) in Turkish differs from others

bull Convert to lower case excluding lsquoIrsquo

Upper case Lower case

Turkish

Azerbaijani

I (U+0049) ı (U+0131)

İ (U+0130) i (U+0069)

Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 43

Normalization for Rumanian

bull Rumanian uses acirc ă icirc ș ț in addition to a-z

bull There are 2 character type as st with a ldquobeardrdquo

ndash U+015E-F U+0162-3 st with cedilla

ndash U+0218-B st with comma below

bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia

bull The 2 code has the same design in some fonts

ndash Indistinguishable

ș ş U+0219 U+015F

ț ţ U+021B U+0163

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

44

Rumanian Character Affairs on PC

bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently

ndash 1989 Democratization in Rumania

ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode

ndash 2007 Rumania seated in the EU

ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 45

lsquost with cedillarsquo is used

on an advertisement board

in Bucharest

Normalization for Substitute Characters

bull lsquost with cedillarsquo are substitute characters

ndash But they are more popular than the others

ndash with cedilla with comma = 2 1

ndash ldquoRumanian IMErdquo outputs the substitutes too D

bull Regard lsquost with commarsquo as lsquost with cedillarsquo

ț ţ U+021B U+0163

I reckon it is similar to the relationship of

Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 46

Arabic Character Normalization (on language-detection)

bull Arabic and Persian have the similar trouble too

bull Character lsquoyehrsquo in Farsi corresponds to 2 code points

ndash Wikipedia uses ی (U+06cc Farsi yeh) only

ndash News uses ي(U+064a Arabic yeh) only

bull U+064a is a substitute in Farsi

ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc

ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails

bull Regard U+06cc as U+064a

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 47

Normalization for Vietnamese (1)

bull Vietnamese has 12 vowels

ndash a ă acirc e ecirc i y o ocirc ơ u ư

bull Vietnamese has 6 tones

ndash a ả agrave atilde aacute ạ

ndash These tone symbols are used also in general documents like news

bull The tone symbols can be appended to all vowels

ndash 12 6 = 72

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 48

Normalization for Vietnamese (2)

bull Representation of vowels with

tones

1 Use U+1ea0 - U+1ef9

bull ẵ = U+1eb5

2 Combine with Diacritical Marks

bull ẵ = U+0103 U+0303

ndash Half and half on news and tweet

bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 49

CJK-Kanji Normalization (1) (on language-detection)

bull CJK-Kanji has too many characters(more than 20K)

ndash Other character types have only 30-50 characters

bull The character space is very sparse

ndash Characters that donrsquot occur in the training corpus have no probabilities

bull eg 谢谢 Kanji for person name

ndash Common frequent characters are too strong

bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese

bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 50

CJK-Kanji Normalization (2) (on language-detection)

bull Group Kanjis by frequency and normalize each group to the representative character

ndash (1) K-means clustering

bull Use tf-idf on Wikipedia and Google News

bull K=50 (size of ascii alphabet = 52)

ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese

bull Simplified Chinese 现代汉语常用字表(3500)

bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)

bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998

ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much

bull Generate 130 clusters from product of (1) and (2)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 51

Normalization for twitter

bull Remove simply

ndash URL

ndash mention

ndash hash tag

ndash RT

ndash face mark using alphabet like XD p

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 52

Normalization for twitter-Specific Representation

bull How to Like lsquocoooooooollllllrsquo

bull Case 1 Make a normalization dictionary using [Brody+ 2011]

ndash Unsupervised normalization like coooollll rarr cool

ndash It canrsquot handle words that are not in the dictionary

bull Case 2 If the same character continues in more than 3 Shrink it to 2

ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of

bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on

bull Acronym (like WWW СССР) is not useful for language detection

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 53

Laugh Normalization

bull There are various laughs on each language

ndash HOW MUCH DO YOU LOVE COACH BEISTE

HHAHAHAHAHAH

ndash Hihihihi ) Habe ich regulaumlr 2x die Woche

ndash Tafil con eso Jajajajajajaja

ndash Malo Jejejeje XP

ndash kekeke chỗ đoacute lagravem aacuteo được ko em

bull Shrink them to double

ndash hahahha rArr haha

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 54

Implementation and Estimation

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 55

Language Detection with Infinity-Gram (ldig)

bull tweet language detection for Latin

alphabet

ndash httpsgithubcomshuyoldig

bull MIT license

bull Distribute also the trained model here

ndash infin-gram LR(maximal substring) [Okanohara+ 09]

ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]

ndash Double Array

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 56

Usage (1) Model Initialization

bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]

ndash Extract features from corpus and initialize model

ndash -m model directory

ndash -x path of maximal substring extractor (execute as external process)

ndash --ff Ignore less than the specified value

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 57

Maximal String Extractor

bull maxsubst [input file] [output file]

ndash Input as multiple line text

bull Replace TABs to ldquo ldquo line feeds to U+0001 in it

ndash Output as rdquo[features]yent[frequency]rdquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 58

Usage (2) Learn

bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]

ndash Learn the model using the corpus on 1 cycle of SGD

ndash -e learning rate of SGD

ndash -r regularizer of L1 regularization

ndash --wr what times to regularize for whole parameters

bull Parameters are too many to regularize the whle ones every step

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 59

Usage (3) Shrink Model

bull ldigpy -m [model] --shrink

ndash Remove Unefficient features(all

parameters of which are 0) from the

model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 60

Usage (4) Detect Language

bull ldigpy -m [model] [test data]

ndash Detect languages of test data and output

its result and summary

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 61

Data Format

bull Training and test data

ndash [correct label]yent[meta data]yent[text]

en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD

62

Usage (5) Estimation Tool

bull serverpy -m [model] -p [port number]

ndash Open httplocalhost[port] after it is executed

ndash Output their language probabilities contained features and their parameters for a text inputed in the text area

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 63

Estimation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus

As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size

64

language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992

total 166711 165053 9901 922 974

Estimation for LIGA dataset

bull Estimate using LIGA[Tromp+ 11] dataset

with 9066 tweets for 6 languages

ndash httpwwwwintuenl~mpechenprojectssmm

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Use 19 language model

65

Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996

total 9066 8992 992

Estimation for Europarl Dataset

Only supported languages for ldig

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 66

ldig langdetect CLDlanguage size correct rate correct rate correct rate

bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993

total 21000 13957 997 20850 993 20814 991

Conclusions

bull Language detector using maximal substring model

ndash Detect over 99 accuracy for 19 languages

ndash langdetect with tweet corpus even has 97 accuracy

bull If the corpus is maintained the precision will be still up

ndash There are still many mistakes (in particular da and no)

bull If metadata is added to features the precision will be still up

ndash How to add and train metadata at low cost

bull Desire to shrink the model without loss of precision

ndash Too large for application (gt100MB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 67

References

bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定

bull [Okanohara+ 09] Text Categorization with All Substring Features

bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs

bull [Cavnar+ 94] N-Gram-Based Text Categorization

bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 68

Page 27: Short Text Language Detection with Infinity-Gram

Maximal Substring (1)

bull Define a containment(semi-order)

among non empty substrings

abracadabra

ndash ldquorardquo sub ldquobraldquo hArr all rdquorardquo occur

as the substring of ldquobrardquo

ndash ldquoardquo nsub ldquoraldquo hArr ldquoardquo occur in not only ldquoraldquo

but also ldquocardquo It is strictly defined with also its position in the substring

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 29

Maximal Substring (2)

bull Each equivalent class formed by the containment relationship has a unique maximal element that is named Maximal Substring

bull Maximal substrings of abracadabra are a abra and abracadabra

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 30

via httpdhatenanejpnokuno201202031328237067

Maximal Substring and Infinity-Gram

bull Frequencies of substrings that have a containment relationship always equal

bull In the model with linear combination of features it is possible to enclose the common feature values

bull Logistic regression with maximal substrings is equivalent to the one with infinity-grams

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Although the equivalence collapses for test set

we assumes that it can be approximated by a sufficiently large training set

31

Extended Suffix Array

bull Extended Suffix Array consists of

ndash SA=Suffix Array

ndash L=Longest Common Prefixes

ndash B=Burrows-Wheelers Transformed text

bull A maximal substring that occurs more than once corresponds to a internal node of Suffix Tree which is equivalent to a suffix with Lgt0 and BWT has more than 1 character type

ndash They can be calculated on linear time

bull esaxx Okanoharas implement of ESA

ndash httpcodegooglecompesaxx

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 32

via [Okanohara+ 09]

Corpus and Normalization

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 33

Target Languages

bull Limit character type to detect

ndash In short text detection mixed text can be

divided to type of characters

bull Latin alphabet language

ndash The most difficult alphabet type to detect

ndash Languages which speakers are over 5

million are more than 25

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 34

Whats Latin Alphabet

bull Latin alphabet ne ascii alphabet

ndash aring ą aelig eth Ħ ŋ and so on

bull They are assigned to 9 code blocks in Unicode

Range Name Supplement

U+0000-007F Basic Latin ascii

U+0080-00FF Latin-1 Supplement Most languages are covered with these U+0100-017F Latin Extended-A

U+0180-024F Latin Extended-B Rumanian

U+0250-02AF IPA Extensions

U+0300-036F Combining Diacritical Marks for tone symbol composition

U+1E00-1EFF Latin Extended Additional Vietnamese

U+2C60-2C7F Latin Extended-C These arenrsquot used by almost all present languages U+A720-A7FF Latin Extended-D

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 35

Latin Alphabets in Unicode Codepoint Chart

for Vietnamese only use often use sometimes

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 36

How to Create Corpus

bull Collect tweets with sample method of

twitter Streaming API

ndash Sampling 1 of all tweets (about 2

million tweets)

ndash Tweets in Latin alphabet language

account for 60 of them

bull The rest is only to annotate language

labels to these tweets

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 37

Language Label Annotation

bull Group tweets by their timezone

ndash French tweets account for about 1 of all ones

ndash But they account for 50 of ones in Paris

timezone only

bull Annotate tentative labels to tweets using

langdetect

ndash Remove non-French tweets from ones labeled lsquofrrsquo

ndash Recover French tweets from ones not labeled lsquofrrsquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 38

( 20 of the whole tweets have no timezone)

How to annotate

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 39

Swedish Norwegian Danish Vietnamese Lithuanian

Czech Hungarian Catalan Rumanian and Polish guides in turn

Created Corpus

bull Noiseless tweets for training data

bull Noiseful tweets with more than 3 words as test data

bull Work with Rauacutel Velaz and Hiroshi Manabe for Catalan corpus creation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 40

language training testca Catalan 9089 5082cs Czech 9082 7682da Dannish 7388 5524de German 44448 10065en English 44520 10168es Spanish 44118 10265fi Finnish 8087 7050fr French 44339 10098hu Hungarian 10030 4904id Indonesian 44722 10181it Italian 43366 10152nl Dutch 44682 10007no Norwegian 10124 8496pl Polish 16771 10152pt Portuguese 44215 10208ro Romanian 10021 5911sv Swedish 44054 10032tr Turkish 44703 10308vi Vietnamese 15030 10488

total 538789 166773

Simple Language Detection

bull Language detector can be constructed

from maximal substring model and

twitter corpus

ndash It still gets at most 98 accuracy

bull We guess it is necessary to reduce bias

ndash data size bias

ndash language-specific bias

ndash twitter-specific bias

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 41

Bias by Data Size

bull Tweet size in each language has huge bias

bull Level them out by sampling with replacement from each language up to the largest data

ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement

English

Portuguese

Spanish

Indonesian

Dutch

French

German

Turkish

Italian

Swedish

othersShort Text Language Detection with Infinity-Gram

(NAIST Seminar) 42

Convert to Lowercase on Multiple Languages

bull Conversion into lower case saves corpus and compresses model

bull But the lower case of I (U+0049) in Turkish differs from others

bull Convert to lower case excluding lsquoIrsquo

Upper case Lower case

Turkish

Azerbaijani

I (U+0049) ı (U+0131)

İ (U+0130) i (U+0069)

Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 43

Normalization for Rumanian

bull Rumanian uses acirc ă icirc ș ț in addition to a-z

bull There are 2 character type as st with a ldquobeardrdquo

ndash U+015E-F U+0162-3 st with cedilla

ndash U+0218-B st with comma below

bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia

bull The 2 code has the same design in some fonts

ndash Indistinguishable

ș ş U+0219 U+015F

ț ţ U+021B U+0163

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

44

Rumanian Character Affairs on PC

bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently

ndash 1989 Democratization in Rumania

ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode

ndash 2007 Rumania seated in the EU

ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 45

lsquost with cedillarsquo is used

on an advertisement board

in Bucharest

Normalization for Substitute Characters

bull lsquost with cedillarsquo are substitute characters

ndash But they are more popular than the others

ndash with cedilla with comma = 2 1

ndash ldquoRumanian IMErdquo outputs the substitutes too D

bull Regard lsquost with commarsquo as lsquost with cedillarsquo

ț ţ U+021B U+0163

I reckon it is similar to the relationship of

Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 46

Arabic Character Normalization (on language-detection)

bull Arabic and Persian have the similar trouble too

bull Character lsquoyehrsquo in Farsi corresponds to 2 code points

ndash Wikipedia uses ی (U+06cc Farsi yeh) only

ndash News uses ي(U+064a Arabic yeh) only

bull U+064a is a substitute in Farsi

ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc

ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails

bull Regard U+06cc as U+064a

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 47

Normalization for Vietnamese (1)

bull Vietnamese has 12 vowels

ndash a ă acirc e ecirc i y o ocirc ơ u ư

bull Vietnamese has 6 tones

ndash a ả agrave atilde aacute ạ

ndash These tone symbols are used also in general documents like news

bull The tone symbols can be appended to all vowels

ndash 12 6 = 72

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 48

Normalization for Vietnamese (2)

bull Representation of vowels with

tones

1 Use U+1ea0 - U+1ef9

bull ẵ = U+1eb5

2 Combine with Diacritical Marks

bull ẵ = U+0103 U+0303

ndash Half and half on news and tweet

bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 49

CJK-Kanji Normalization (1) (on language-detection)

bull CJK-Kanji has too many characters(more than 20K)

ndash Other character types have only 30-50 characters

bull The character space is very sparse

ndash Characters that donrsquot occur in the training corpus have no probabilities

bull eg 谢谢 Kanji for person name

ndash Common frequent characters are too strong

bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese

bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 50

CJK-Kanji Normalization (2) (on language-detection)

bull Group Kanjis by frequency and normalize each group to the representative character

ndash (1) K-means clustering

bull Use tf-idf on Wikipedia and Google News

bull K=50 (size of ascii alphabet = 52)

ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese

bull Simplified Chinese 现代汉语常用字表(3500)

bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)

bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998

ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much

bull Generate 130 clusters from product of (1) and (2)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 51

Normalization for twitter

bull Remove simply

ndash URL

ndash mention

ndash hash tag

ndash RT

ndash face mark using alphabet like XD p

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 52

Normalization for twitter-Specific Representation

bull How to Like lsquocoooooooollllllrsquo

bull Case 1 Make a normalization dictionary using [Brody+ 2011]

ndash Unsupervised normalization like coooollll rarr cool

ndash It canrsquot handle words that are not in the dictionary

bull Case 2 If the same character continues in more than 3 Shrink it to 2

ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of

bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on

bull Acronym (like WWW СССР) is not useful for language detection

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 53

Laugh Normalization

bull There are various laughs on each language

ndash HOW MUCH DO YOU LOVE COACH BEISTE

HHAHAHAHAHAH

ndash Hihihihi ) Habe ich regulaumlr 2x die Woche

ndash Tafil con eso Jajajajajajaja

ndash Malo Jejejeje XP

ndash kekeke chỗ đoacute lagravem aacuteo được ko em

bull Shrink them to double

ndash hahahha rArr haha

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 54

Implementation and Estimation

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 55

Language Detection with Infinity-Gram (ldig)

bull tweet language detection for Latin

alphabet

ndash httpsgithubcomshuyoldig

bull MIT license

bull Distribute also the trained model here

ndash infin-gram LR(maximal substring) [Okanohara+ 09]

ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]

ndash Double Array

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 56

Usage (1) Model Initialization

bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]

ndash Extract features from corpus and initialize model

ndash -m model directory

ndash -x path of maximal substring extractor (execute as external process)

ndash --ff Ignore less than the specified value

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 57

Maximal String Extractor

bull maxsubst [input file] [output file]

ndash Input as multiple line text

bull Replace TABs to ldquo ldquo line feeds to U+0001 in it

ndash Output as rdquo[features]yent[frequency]rdquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 58

Usage (2) Learn

bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]

ndash Learn the model using the corpus on 1 cycle of SGD

ndash -e learning rate of SGD

ndash -r regularizer of L1 regularization

ndash --wr what times to regularize for whole parameters

bull Parameters are too many to regularize the whle ones every step

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 59

Usage (3) Shrink Model

bull ldigpy -m [model] --shrink

ndash Remove Unefficient features(all

parameters of which are 0) from the

model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 60

Usage (4) Detect Language

bull ldigpy -m [model] [test data]

ndash Detect languages of test data and output

its result and summary

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 61

Data Format

bull Training and test data

ndash [correct label]yent[meta data]yent[text]

en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD

62

Usage (5) Estimation Tool

bull serverpy -m [model] -p [port number]

ndash Open httplocalhost[port] after it is executed

ndash Output their language probabilities contained features and their parameters for a text inputed in the text area

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 63

Estimation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus

As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size

64

language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992

total 166711 165053 9901 922 974

Estimation for LIGA dataset

bull Estimate using LIGA[Tromp+ 11] dataset

with 9066 tweets for 6 languages

ndash httpwwwwintuenl~mpechenprojectssmm

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Use 19 language model

65

Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996

total 9066 8992 992

Estimation for Europarl Dataset

Only supported languages for ldig

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 66

ldig langdetect CLDlanguage size correct rate correct rate correct rate

bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993

total 21000 13957 997 20850 993 20814 991

Conclusions

bull Language detector using maximal substring model

ndash Detect over 99 accuracy for 19 languages

ndash langdetect with tweet corpus even has 97 accuracy

bull If the corpus is maintained the precision will be still up

ndash There are still many mistakes (in particular da and no)

bull If metadata is added to features the precision will be still up

ndash How to add and train metadata at low cost

bull Desire to shrink the model without loss of precision

ndash Too large for application (gt100MB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 67

References

bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定

bull [Okanohara+ 09] Text Categorization with All Substring Features

bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs

bull [Cavnar+ 94] N-Gram-Based Text Categorization

bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 68

Page 28: Short Text Language Detection with Infinity-Gram

Maximal Substring (2)

bull Each equivalent class formed by the containment relationship has a unique maximal element that is named Maximal Substring

bull Maximal substrings of abracadabra are a abra and abracadabra

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 30

via httpdhatenanejpnokuno201202031328237067

Maximal Substring and Infinity-Gram

bull Frequencies of substrings that have a containment relationship always equal

bull In the model with linear combination of features it is possible to enclose the common feature values

bull Logistic regression with maximal substrings is equivalent to the one with infinity-grams

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Although the equivalence collapses for test set

we assumes that it can be approximated by a sufficiently large training set

31

Extended Suffix Array

bull Extended Suffix Array consists of

ndash SA=Suffix Array

ndash L=Longest Common Prefixes

ndash B=Burrows-Wheelers Transformed text

bull A maximal substring that occurs more than once corresponds to a internal node of Suffix Tree which is equivalent to a suffix with Lgt0 and BWT has more than 1 character type

ndash They can be calculated on linear time

bull esaxx Okanoharas implement of ESA

ndash httpcodegooglecompesaxx

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 32

via [Okanohara+ 09]

Corpus and Normalization

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 33

Target Languages

bull Limit character type to detect

ndash In short text detection mixed text can be

divided to type of characters

bull Latin alphabet language

ndash The most difficult alphabet type to detect

ndash Languages which speakers are over 5

million are more than 25

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 34

Whats Latin Alphabet

bull Latin alphabet ne ascii alphabet

ndash aring ą aelig eth Ħ ŋ and so on

bull They are assigned to 9 code blocks in Unicode

Range Name Supplement

U+0000-007F Basic Latin ascii

U+0080-00FF Latin-1 Supplement Most languages are covered with these U+0100-017F Latin Extended-A

U+0180-024F Latin Extended-B Rumanian

U+0250-02AF IPA Extensions

U+0300-036F Combining Diacritical Marks for tone symbol composition

U+1E00-1EFF Latin Extended Additional Vietnamese

U+2C60-2C7F Latin Extended-C These arenrsquot used by almost all present languages U+A720-A7FF Latin Extended-D

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 35

Latin Alphabets in Unicode Codepoint Chart

for Vietnamese only use often use sometimes

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 36

How to Create Corpus

bull Collect tweets with sample method of

twitter Streaming API

ndash Sampling 1 of all tweets (about 2

million tweets)

ndash Tweets in Latin alphabet language

account for 60 of them

bull The rest is only to annotate language

labels to these tweets

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 37

Language Label Annotation

bull Group tweets by their timezone

ndash French tweets account for about 1 of all ones

ndash But they account for 50 of ones in Paris

timezone only

bull Annotate tentative labels to tweets using

langdetect

ndash Remove non-French tweets from ones labeled lsquofrrsquo

ndash Recover French tweets from ones not labeled lsquofrrsquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 38

( 20 of the whole tweets have no timezone)

How to annotate

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 39

Swedish Norwegian Danish Vietnamese Lithuanian

Czech Hungarian Catalan Rumanian and Polish guides in turn

Created Corpus

bull Noiseless tweets for training data

bull Noiseful tweets with more than 3 words as test data

bull Work with Rauacutel Velaz and Hiroshi Manabe for Catalan corpus creation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 40

language training testca Catalan 9089 5082cs Czech 9082 7682da Dannish 7388 5524de German 44448 10065en English 44520 10168es Spanish 44118 10265fi Finnish 8087 7050fr French 44339 10098hu Hungarian 10030 4904id Indonesian 44722 10181it Italian 43366 10152nl Dutch 44682 10007no Norwegian 10124 8496pl Polish 16771 10152pt Portuguese 44215 10208ro Romanian 10021 5911sv Swedish 44054 10032tr Turkish 44703 10308vi Vietnamese 15030 10488

total 538789 166773

Simple Language Detection

bull Language detector can be constructed

from maximal substring model and

twitter corpus

ndash It still gets at most 98 accuracy

bull We guess it is necessary to reduce bias

ndash data size bias

ndash language-specific bias

ndash twitter-specific bias

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 41

Bias by Data Size

bull Tweet size in each language has huge bias

bull Level them out by sampling with replacement from each language up to the largest data

ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement

English

Portuguese

Spanish

Indonesian

Dutch

French

German

Turkish

Italian

Swedish

othersShort Text Language Detection with Infinity-Gram

(NAIST Seminar) 42

Convert to Lowercase on Multiple Languages

bull Conversion into lower case saves corpus and compresses model

bull But the lower case of I (U+0049) in Turkish differs from others

bull Convert to lower case excluding lsquoIrsquo

Upper case Lower case

Turkish

Azerbaijani

I (U+0049) ı (U+0131)

İ (U+0130) i (U+0069)

Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 43

Normalization for Rumanian

bull Rumanian uses acirc ă icirc ș ț in addition to a-z

bull There are 2 character type as st with a ldquobeardrdquo

ndash U+015E-F U+0162-3 st with cedilla

ndash U+0218-B st with comma below

bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia

bull The 2 code has the same design in some fonts

ndash Indistinguishable

ș ş U+0219 U+015F

ț ţ U+021B U+0163

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

44

Rumanian Character Affairs on PC

bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently

ndash 1989 Democratization in Rumania

ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode

ndash 2007 Rumania seated in the EU

ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 45

lsquost with cedillarsquo is used

on an advertisement board

in Bucharest

Normalization for Substitute Characters

bull lsquost with cedillarsquo are substitute characters

ndash But they are more popular than the others

ndash with cedilla with comma = 2 1

ndash ldquoRumanian IMErdquo outputs the substitutes too D

bull Regard lsquost with commarsquo as lsquost with cedillarsquo

ț ţ U+021B U+0163

I reckon it is similar to the relationship of

Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 46

Arabic Character Normalization (on language-detection)

bull Arabic and Persian have the similar trouble too

bull Character lsquoyehrsquo in Farsi corresponds to 2 code points

ndash Wikipedia uses ی (U+06cc Farsi yeh) only

ndash News uses ي(U+064a Arabic yeh) only

bull U+064a is a substitute in Farsi

ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc

ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails

bull Regard U+06cc as U+064a

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 47

Normalization for Vietnamese (1)

bull Vietnamese has 12 vowels

ndash a ă acirc e ecirc i y o ocirc ơ u ư

bull Vietnamese has 6 tones

ndash a ả agrave atilde aacute ạ

ndash These tone symbols are used also in general documents like news

bull The tone symbols can be appended to all vowels

ndash 12 6 = 72

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 48

Normalization for Vietnamese (2)

bull Representation of vowels with

tones

1 Use U+1ea0 - U+1ef9

bull ẵ = U+1eb5

2 Combine with Diacritical Marks

bull ẵ = U+0103 U+0303

ndash Half and half on news and tweet

bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 49

CJK-Kanji Normalization (1) (on language-detection)

bull CJK-Kanji has too many characters(more than 20K)

ndash Other character types have only 30-50 characters

bull The character space is very sparse

ndash Characters that donrsquot occur in the training corpus have no probabilities

bull eg 谢谢 Kanji for person name

ndash Common frequent characters are too strong

bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese

bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 50

CJK-Kanji Normalization (2) (on language-detection)

bull Group Kanjis by frequency and normalize each group to the representative character

ndash (1) K-means clustering

bull Use tf-idf on Wikipedia and Google News

bull K=50 (size of ascii alphabet = 52)

ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese

bull Simplified Chinese 现代汉语常用字表(3500)

bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)

bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998

ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much

bull Generate 130 clusters from product of (1) and (2)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 51

Normalization for twitter

bull Remove simply

ndash URL

ndash mention

ndash hash tag

ndash RT

ndash face mark using alphabet like XD p

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 52

Normalization for twitter-Specific Representation

bull How to Like lsquocoooooooollllllrsquo

bull Case 1 Make a normalization dictionary using [Brody+ 2011]

ndash Unsupervised normalization like coooollll rarr cool

ndash It canrsquot handle words that are not in the dictionary

bull Case 2 If the same character continues in more than 3 Shrink it to 2

ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of

bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on

bull Acronym (like WWW СССР) is not useful for language detection

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 53

Laugh Normalization

bull There are various laughs on each language

ndash HOW MUCH DO YOU LOVE COACH BEISTE

HHAHAHAHAHAH

ndash Hihihihi ) Habe ich regulaumlr 2x die Woche

ndash Tafil con eso Jajajajajajaja

ndash Malo Jejejeje XP

ndash kekeke chỗ đoacute lagravem aacuteo được ko em

bull Shrink them to double

ndash hahahha rArr haha

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 54

Implementation and Estimation

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 55

Language Detection with Infinity-Gram (ldig)

bull tweet language detection for Latin

alphabet

ndash httpsgithubcomshuyoldig

bull MIT license

bull Distribute also the trained model here

ndash infin-gram LR(maximal substring) [Okanohara+ 09]

ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]

ndash Double Array

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 56

Usage (1) Model Initialization

bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]

ndash Extract features from corpus and initialize model

ndash -m model directory

ndash -x path of maximal substring extractor (execute as external process)

ndash --ff Ignore less than the specified value

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 57

Maximal String Extractor

bull maxsubst [input file] [output file]

ndash Input as multiple line text

bull Replace TABs to ldquo ldquo line feeds to U+0001 in it

ndash Output as rdquo[features]yent[frequency]rdquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 58

Usage (2) Learn

bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]

ndash Learn the model using the corpus on 1 cycle of SGD

ndash -e learning rate of SGD

ndash -r regularizer of L1 regularization

ndash --wr what times to regularize for whole parameters

bull Parameters are too many to regularize the whle ones every step

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 59

Usage (3) Shrink Model

bull ldigpy -m [model] --shrink

ndash Remove Unefficient features(all

parameters of which are 0) from the

model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 60

Usage (4) Detect Language

bull ldigpy -m [model] [test data]

ndash Detect languages of test data and output

its result and summary

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 61

Data Format

bull Training and test data

ndash [correct label]yent[meta data]yent[text]

en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD

62

Usage (5) Estimation Tool

bull serverpy -m [model] -p [port number]

ndash Open httplocalhost[port] after it is executed

ndash Output their language probabilities contained features and their parameters for a text inputed in the text area

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 63

Estimation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus

As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size

64

language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992

total 166711 165053 9901 922 974

Estimation for LIGA dataset

bull Estimate using LIGA[Tromp+ 11] dataset

with 9066 tweets for 6 languages

ndash httpwwwwintuenl~mpechenprojectssmm

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Use 19 language model

65

Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996

total 9066 8992 992

Estimation for Europarl Dataset

Only supported languages for ldig

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 66

ldig langdetect CLDlanguage size correct rate correct rate correct rate

bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993

total 21000 13957 997 20850 993 20814 991

Conclusions

bull Language detector using maximal substring model

ndash Detect over 99 accuracy for 19 languages

ndash langdetect with tweet corpus even has 97 accuracy

bull If the corpus is maintained the precision will be still up

ndash There are still many mistakes (in particular da and no)

bull If metadata is added to features the precision will be still up

ndash How to add and train metadata at low cost

bull Desire to shrink the model without loss of precision

ndash Too large for application (gt100MB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 67

References

bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定

bull [Okanohara+ 09] Text Categorization with All Substring Features

bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs

bull [Cavnar+ 94] N-Gram-Based Text Categorization

bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 68

Page 29: Short Text Language Detection with Infinity-Gram

Maximal Substring and Infinity-Gram

bull Frequencies of substrings that have a containment relationship always equal

bull In the model with linear combination of features it is possible to enclose the common feature values

bull Logistic regression with maximal substrings is equivalent to the one with infinity-grams

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Although the equivalence collapses for test set

we assumes that it can be approximated by a sufficiently large training set

31

Extended Suffix Array

bull Extended Suffix Array consists of

ndash SA=Suffix Array

ndash L=Longest Common Prefixes

ndash B=Burrows-Wheelers Transformed text

bull A maximal substring that occurs more than once corresponds to a internal node of Suffix Tree which is equivalent to a suffix with Lgt0 and BWT has more than 1 character type

ndash They can be calculated on linear time

bull esaxx Okanoharas implement of ESA

ndash httpcodegooglecompesaxx

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 32

via [Okanohara+ 09]

Corpus and Normalization

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 33

Target Languages

bull Limit character type to detect

ndash In short text detection mixed text can be

divided to type of characters

bull Latin alphabet language

ndash The most difficult alphabet type to detect

ndash Languages which speakers are over 5

million are more than 25

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 34

Whats Latin Alphabet

bull Latin alphabet ne ascii alphabet

ndash aring ą aelig eth Ħ ŋ and so on

bull They are assigned to 9 code blocks in Unicode

Range Name Supplement

U+0000-007F Basic Latin ascii

U+0080-00FF Latin-1 Supplement Most languages are covered with these U+0100-017F Latin Extended-A

U+0180-024F Latin Extended-B Rumanian

U+0250-02AF IPA Extensions

U+0300-036F Combining Diacritical Marks for tone symbol composition

U+1E00-1EFF Latin Extended Additional Vietnamese

U+2C60-2C7F Latin Extended-C These arenrsquot used by almost all present languages U+A720-A7FF Latin Extended-D

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 35

Latin Alphabets in Unicode Codepoint Chart

for Vietnamese only use often use sometimes

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 36

How to Create Corpus

bull Collect tweets with sample method of

twitter Streaming API

ndash Sampling 1 of all tweets (about 2

million tweets)

ndash Tweets in Latin alphabet language

account for 60 of them

bull The rest is only to annotate language

labels to these tweets

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 37

Language Label Annotation

bull Group tweets by their timezone

ndash French tweets account for about 1 of all ones

ndash But they account for 50 of ones in Paris

timezone only

bull Annotate tentative labels to tweets using

langdetect

ndash Remove non-French tweets from ones labeled lsquofrrsquo

ndash Recover French tweets from ones not labeled lsquofrrsquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 38

( 20 of the whole tweets have no timezone)

How to annotate

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 39

Swedish Norwegian Danish Vietnamese Lithuanian

Czech Hungarian Catalan Rumanian and Polish guides in turn

Created Corpus

bull Noiseless tweets for training data

bull Noiseful tweets with more than 3 words as test data

bull Work with Rauacutel Velaz and Hiroshi Manabe for Catalan corpus creation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 40

language training testca Catalan 9089 5082cs Czech 9082 7682da Dannish 7388 5524de German 44448 10065en English 44520 10168es Spanish 44118 10265fi Finnish 8087 7050fr French 44339 10098hu Hungarian 10030 4904id Indonesian 44722 10181it Italian 43366 10152nl Dutch 44682 10007no Norwegian 10124 8496pl Polish 16771 10152pt Portuguese 44215 10208ro Romanian 10021 5911sv Swedish 44054 10032tr Turkish 44703 10308vi Vietnamese 15030 10488

total 538789 166773

Simple Language Detection

bull Language detector can be constructed

from maximal substring model and

twitter corpus

ndash It still gets at most 98 accuracy

bull We guess it is necessary to reduce bias

ndash data size bias

ndash language-specific bias

ndash twitter-specific bias

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 41

Bias by Data Size

bull Tweet size in each language has huge bias

bull Level them out by sampling with replacement from each language up to the largest data

ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement

English

Portuguese

Spanish

Indonesian

Dutch

French

German

Turkish

Italian

Swedish

othersShort Text Language Detection with Infinity-Gram

(NAIST Seminar) 42

Convert to Lowercase on Multiple Languages

bull Conversion into lower case saves corpus and compresses model

bull But the lower case of I (U+0049) in Turkish differs from others

bull Convert to lower case excluding lsquoIrsquo

Upper case Lower case

Turkish

Azerbaijani

I (U+0049) ı (U+0131)

İ (U+0130) i (U+0069)

Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 43

Normalization for Rumanian

bull Rumanian uses acirc ă icirc ș ț in addition to a-z

bull There are 2 character type as st with a ldquobeardrdquo

ndash U+015E-F U+0162-3 st with cedilla

ndash U+0218-B st with comma below

bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia

bull The 2 code has the same design in some fonts

ndash Indistinguishable

ș ş U+0219 U+015F

ț ţ U+021B U+0163

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

44

Rumanian Character Affairs on PC

bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently

ndash 1989 Democratization in Rumania

ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode

ndash 2007 Rumania seated in the EU

ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 45

lsquost with cedillarsquo is used

on an advertisement board

in Bucharest

Normalization for Substitute Characters

bull lsquost with cedillarsquo are substitute characters

ndash But they are more popular than the others

ndash with cedilla with comma = 2 1

ndash ldquoRumanian IMErdquo outputs the substitutes too D

bull Regard lsquost with commarsquo as lsquost with cedillarsquo

ț ţ U+021B U+0163

I reckon it is similar to the relationship of

Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 46

Arabic Character Normalization (on language-detection)

bull Arabic and Persian have the similar trouble too

bull Character lsquoyehrsquo in Farsi corresponds to 2 code points

ndash Wikipedia uses ی (U+06cc Farsi yeh) only

ndash News uses ي(U+064a Arabic yeh) only

bull U+064a is a substitute in Farsi

ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc

ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails

bull Regard U+06cc as U+064a

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 47

Normalization for Vietnamese (1)

bull Vietnamese has 12 vowels

ndash a ă acirc e ecirc i y o ocirc ơ u ư

bull Vietnamese has 6 tones

ndash a ả agrave atilde aacute ạ

ndash These tone symbols are used also in general documents like news

bull The tone symbols can be appended to all vowels

ndash 12 6 = 72

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 48

Normalization for Vietnamese (2)

bull Representation of vowels with

tones

1 Use U+1ea0 - U+1ef9

bull ẵ = U+1eb5

2 Combine with Diacritical Marks

bull ẵ = U+0103 U+0303

ndash Half and half on news and tweet

bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 49

CJK-Kanji Normalization (1) (on language-detection)

bull CJK-Kanji has too many characters(more than 20K)

ndash Other character types have only 30-50 characters

bull The character space is very sparse

ndash Characters that donrsquot occur in the training corpus have no probabilities

bull eg 谢谢 Kanji for person name

ndash Common frequent characters are too strong

bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese

bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 50

CJK-Kanji Normalization (2) (on language-detection)

bull Group Kanjis by frequency and normalize each group to the representative character

ndash (1) K-means clustering

bull Use tf-idf on Wikipedia and Google News

bull K=50 (size of ascii alphabet = 52)

ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese

bull Simplified Chinese 现代汉语常用字表(3500)

bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)

bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998

ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much

bull Generate 130 clusters from product of (1) and (2)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 51

Normalization for twitter

bull Remove simply

ndash URL

ndash mention

ndash hash tag

ndash RT

ndash face mark using alphabet like XD p

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 52

Normalization for twitter-Specific Representation

bull How to Like lsquocoooooooollllllrsquo

bull Case 1 Make a normalization dictionary using [Brody+ 2011]

ndash Unsupervised normalization like coooollll rarr cool

ndash It canrsquot handle words that are not in the dictionary

bull Case 2 If the same character continues in more than 3 Shrink it to 2

ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of

bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on

bull Acronym (like WWW СССР) is not useful for language detection

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 53

Laugh Normalization

bull There are various laughs on each language

ndash HOW MUCH DO YOU LOVE COACH BEISTE

HHAHAHAHAHAH

ndash Hihihihi ) Habe ich regulaumlr 2x die Woche

ndash Tafil con eso Jajajajajajaja

ndash Malo Jejejeje XP

ndash kekeke chỗ đoacute lagravem aacuteo được ko em

bull Shrink them to double

ndash hahahha rArr haha

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 54

Implementation and Estimation

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 55

Language Detection with Infinity-Gram (ldig)

bull tweet language detection for Latin

alphabet

ndash httpsgithubcomshuyoldig

bull MIT license

bull Distribute also the trained model here

ndash infin-gram LR(maximal substring) [Okanohara+ 09]

ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]

ndash Double Array

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 56

Usage (1) Model Initialization

bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]

ndash Extract features from corpus and initialize model

ndash -m model directory

ndash -x path of maximal substring extractor (execute as external process)

ndash --ff Ignore less than the specified value

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 57

Maximal String Extractor

bull maxsubst [input file] [output file]

ndash Input as multiple line text

bull Replace TABs to ldquo ldquo line feeds to U+0001 in it

ndash Output as rdquo[features]yent[frequency]rdquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 58

Usage (2) Learn

bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]

ndash Learn the model using the corpus on 1 cycle of SGD

ndash -e learning rate of SGD

ndash -r regularizer of L1 regularization

ndash --wr what times to regularize for whole parameters

bull Parameters are too many to regularize the whle ones every step

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 59

Usage (3) Shrink Model

bull ldigpy -m [model] --shrink

ndash Remove Unefficient features(all

parameters of which are 0) from the

model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 60

Usage (4) Detect Language

bull ldigpy -m [model] [test data]

ndash Detect languages of test data and output

its result and summary

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 61

Data Format

bull Training and test data

ndash [correct label]yent[meta data]yent[text]

en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD

62

Usage (5) Estimation Tool

bull serverpy -m [model] -p [port number]

ndash Open httplocalhost[port] after it is executed

ndash Output their language probabilities contained features and their parameters for a text inputed in the text area

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 63

Estimation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus

As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size

64

language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992

total 166711 165053 9901 922 974

Estimation for LIGA dataset

bull Estimate using LIGA[Tromp+ 11] dataset

with 9066 tweets for 6 languages

ndash httpwwwwintuenl~mpechenprojectssmm

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Use 19 language model

65

Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996

total 9066 8992 992

Estimation for Europarl Dataset

Only supported languages for ldig

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 66

ldig langdetect CLDlanguage size correct rate correct rate correct rate

bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993

total 21000 13957 997 20850 993 20814 991

Conclusions

bull Language detector using maximal substring model

ndash Detect over 99 accuracy for 19 languages

ndash langdetect with tweet corpus even has 97 accuracy

bull If the corpus is maintained the precision will be still up

ndash There are still many mistakes (in particular da and no)

bull If metadata is added to features the precision will be still up

ndash How to add and train metadata at low cost

bull Desire to shrink the model without loss of precision

ndash Too large for application (gt100MB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 67

References

bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定

bull [Okanohara+ 09] Text Categorization with All Substring Features

bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs

bull [Cavnar+ 94] N-Gram-Based Text Categorization

bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 68

Page 30: Short Text Language Detection with Infinity-Gram

Extended Suffix Array

bull Extended Suffix Array consists of

ndash SA=Suffix Array

ndash L=Longest Common Prefixes

ndash B=Burrows-Wheelers Transformed text

bull A maximal substring that occurs more than once corresponds to a internal node of Suffix Tree which is equivalent to a suffix with Lgt0 and BWT has more than 1 character type

ndash They can be calculated on linear time

bull esaxx Okanoharas implement of ESA

ndash httpcodegooglecompesaxx

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 32

via [Okanohara+ 09]

Corpus and Normalization

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 33

Target Languages

bull Limit character type to detect

ndash In short text detection mixed text can be

divided to type of characters

bull Latin alphabet language

ndash The most difficult alphabet type to detect

ndash Languages which speakers are over 5

million are more than 25

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 34

Whats Latin Alphabet

bull Latin alphabet ne ascii alphabet

ndash aring ą aelig eth Ħ ŋ and so on

bull They are assigned to 9 code blocks in Unicode

Range Name Supplement

U+0000-007F Basic Latin ascii

U+0080-00FF Latin-1 Supplement Most languages are covered with these U+0100-017F Latin Extended-A

U+0180-024F Latin Extended-B Rumanian

U+0250-02AF IPA Extensions

U+0300-036F Combining Diacritical Marks for tone symbol composition

U+1E00-1EFF Latin Extended Additional Vietnamese

U+2C60-2C7F Latin Extended-C These arenrsquot used by almost all present languages U+A720-A7FF Latin Extended-D

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 35

Latin Alphabets in Unicode Codepoint Chart

for Vietnamese only use often use sometimes

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 36

How to Create Corpus

bull Collect tweets with sample method of

twitter Streaming API

ndash Sampling 1 of all tweets (about 2

million tweets)

ndash Tweets in Latin alphabet language

account for 60 of them

bull The rest is only to annotate language

labels to these tweets

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 37

Language Label Annotation

bull Group tweets by their timezone

ndash French tweets account for about 1 of all ones

ndash But they account for 50 of ones in Paris

timezone only

bull Annotate tentative labels to tweets using

langdetect

ndash Remove non-French tweets from ones labeled lsquofrrsquo

ndash Recover French tweets from ones not labeled lsquofrrsquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 38

( 20 of the whole tweets have no timezone)

How to annotate

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 39

Swedish Norwegian Danish Vietnamese Lithuanian

Czech Hungarian Catalan Rumanian and Polish guides in turn

Created Corpus

bull Noiseless tweets for training data

bull Noiseful tweets with more than 3 words as test data

bull Work with Rauacutel Velaz and Hiroshi Manabe for Catalan corpus creation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 40

language training testca Catalan 9089 5082cs Czech 9082 7682da Dannish 7388 5524de German 44448 10065en English 44520 10168es Spanish 44118 10265fi Finnish 8087 7050fr French 44339 10098hu Hungarian 10030 4904id Indonesian 44722 10181it Italian 43366 10152nl Dutch 44682 10007no Norwegian 10124 8496pl Polish 16771 10152pt Portuguese 44215 10208ro Romanian 10021 5911sv Swedish 44054 10032tr Turkish 44703 10308vi Vietnamese 15030 10488

total 538789 166773

Simple Language Detection

bull Language detector can be constructed

from maximal substring model and

twitter corpus

ndash It still gets at most 98 accuracy

bull We guess it is necessary to reduce bias

ndash data size bias

ndash language-specific bias

ndash twitter-specific bias

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 41

Bias by Data Size

bull Tweet size in each language has huge bias

bull Level them out by sampling with replacement from each language up to the largest data

ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement

English

Portuguese

Spanish

Indonesian

Dutch

French

German

Turkish

Italian

Swedish

othersShort Text Language Detection with Infinity-Gram

(NAIST Seminar) 42

Convert to Lowercase on Multiple Languages

bull Conversion into lower case saves corpus and compresses model

bull But the lower case of I (U+0049) in Turkish differs from others

bull Convert to lower case excluding lsquoIrsquo

Upper case Lower case

Turkish

Azerbaijani

I (U+0049) ı (U+0131)

İ (U+0130) i (U+0069)

Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 43

Normalization for Rumanian

bull Rumanian uses acirc ă icirc ș ț in addition to a-z

bull There are 2 character type as st with a ldquobeardrdquo

ndash U+015E-F U+0162-3 st with cedilla

ndash U+0218-B st with comma below

bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia

bull The 2 code has the same design in some fonts

ndash Indistinguishable

ș ş U+0219 U+015F

ț ţ U+021B U+0163

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

44

Rumanian Character Affairs on PC

bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently

ndash 1989 Democratization in Rumania

ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode

ndash 2007 Rumania seated in the EU

ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 45

lsquost with cedillarsquo is used

on an advertisement board

in Bucharest

Normalization for Substitute Characters

bull lsquost with cedillarsquo are substitute characters

ndash But they are more popular than the others

ndash with cedilla with comma = 2 1

ndash ldquoRumanian IMErdquo outputs the substitutes too D

bull Regard lsquost with commarsquo as lsquost with cedillarsquo

ț ţ U+021B U+0163

I reckon it is similar to the relationship of

Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 46

Arabic Character Normalization (on language-detection)

bull Arabic and Persian have the similar trouble too

bull Character lsquoyehrsquo in Farsi corresponds to 2 code points

ndash Wikipedia uses ی (U+06cc Farsi yeh) only

ndash News uses ي(U+064a Arabic yeh) only

bull U+064a is a substitute in Farsi

ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc

ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails

bull Regard U+06cc as U+064a

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 47

Normalization for Vietnamese (1)

bull Vietnamese has 12 vowels

ndash a ă acirc e ecirc i y o ocirc ơ u ư

bull Vietnamese has 6 tones

ndash a ả agrave atilde aacute ạ

ndash These tone symbols are used also in general documents like news

bull The tone symbols can be appended to all vowels

ndash 12 6 = 72

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 48

Normalization for Vietnamese (2)

bull Representation of vowels with

tones

1 Use U+1ea0 - U+1ef9

bull ẵ = U+1eb5

2 Combine with Diacritical Marks

bull ẵ = U+0103 U+0303

ndash Half and half on news and tweet

bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 49

CJK-Kanji Normalization (1) (on language-detection)

bull CJK-Kanji has too many characters(more than 20K)

ndash Other character types have only 30-50 characters

bull The character space is very sparse

ndash Characters that donrsquot occur in the training corpus have no probabilities

bull eg 谢谢 Kanji for person name

ndash Common frequent characters are too strong

bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese

bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 50

CJK-Kanji Normalization (2) (on language-detection)

bull Group Kanjis by frequency and normalize each group to the representative character

ndash (1) K-means clustering

bull Use tf-idf on Wikipedia and Google News

bull K=50 (size of ascii alphabet = 52)

ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese

bull Simplified Chinese 现代汉语常用字表(3500)

bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)

bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998

ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much

bull Generate 130 clusters from product of (1) and (2)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 51

Normalization for twitter

bull Remove simply

ndash URL

ndash mention

ndash hash tag

ndash RT

ndash face mark using alphabet like XD p

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 52

Normalization for twitter-Specific Representation

bull How to Like lsquocoooooooollllllrsquo

bull Case 1 Make a normalization dictionary using [Brody+ 2011]

ndash Unsupervised normalization like coooollll rarr cool

ndash It canrsquot handle words that are not in the dictionary

bull Case 2 If the same character continues in more than 3 Shrink it to 2

ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of

bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on

bull Acronym (like WWW СССР) is not useful for language detection

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 53

Laugh Normalization

bull There are various laughs on each language

ndash HOW MUCH DO YOU LOVE COACH BEISTE

HHAHAHAHAHAH

ndash Hihihihi ) Habe ich regulaumlr 2x die Woche

ndash Tafil con eso Jajajajajajaja

ndash Malo Jejejeje XP

ndash kekeke chỗ đoacute lagravem aacuteo được ko em

bull Shrink them to double

ndash hahahha rArr haha

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 54

Implementation and Estimation

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 55

Language Detection with Infinity-Gram (ldig)

bull tweet language detection for Latin

alphabet

ndash httpsgithubcomshuyoldig

bull MIT license

bull Distribute also the trained model here

ndash infin-gram LR(maximal substring) [Okanohara+ 09]

ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]

ndash Double Array

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 56

Usage (1) Model Initialization

bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]

ndash Extract features from corpus and initialize model

ndash -m model directory

ndash -x path of maximal substring extractor (execute as external process)

ndash --ff Ignore less than the specified value

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 57

Maximal String Extractor

bull maxsubst [input file] [output file]

ndash Input as multiple line text

bull Replace TABs to ldquo ldquo line feeds to U+0001 in it

ndash Output as rdquo[features]yent[frequency]rdquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 58

Usage (2) Learn

bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]

ndash Learn the model using the corpus on 1 cycle of SGD

ndash -e learning rate of SGD

ndash -r regularizer of L1 regularization

ndash --wr what times to regularize for whole parameters

bull Parameters are too many to regularize the whle ones every step

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 59

Usage (3) Shrink Model

bull ldigpy -m [model] --shrink

ndash Remove Unefficient features(all

parameters of which are 0) from the

model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 60

Usage (4) Detect Language

bull ldigpy -m [model] [test data]

ndash Detect languages of test data and output

its result and summary

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 61

Data Format

bull Training and test data

ndash [correct label]yent[meta data]yent[text]

en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD

62

Usage (5) Estimation Tool

bull serverpy -m [model] -p [port number]

ndash Open httplocalhost[port] after it is executed

ndash Output their language probabilities contained features and their parameters for a text inputed in the text area

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 63

Estimation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus

As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size

64

language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992

total 166711 165053 9901 922 974

Estimation for LIGA dataset

bull Estimate using LIGA[Tromp+ 11] dataset

with 9066 tweets for 6 languages

ndash httpwwwwintuenl~mpechenprojectssmm

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Use 19 language model

65

Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996

total 9066 8992 992

Estimation for Europarl Dataset

Only supported languages for ldig

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 66

ldig langdetect CLDlanguage size correct rate correct rate correct rate

bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993

total 21000 13957 997 20850 993 20814 991

Conclusions

bull Language detector using maximal substring model

ndash Detect over 99 accuracy for 19 languages

ndash langdetect with tweet corpus even has 97 accuracy

bull If the corpus is maintained the precision will be still up

ndash There are still many mistakes (in particular da and no)

bull If metadata is added to features the precision will be still up

ndash How to add and train metadata at low cost

bull Desire to shrink the model without loss of precision

ndash Too large for application (gt100MB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 67

References

bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定

bull [Okanohara+ 09] Text Categorization with All Substring Features

bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs

bull [Cavnar+ 94] N-Gram-Based Text Categorization

bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 68

Page 31: Short Text Language Detection with Infinity-Gram

Corpus and Normalization

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 33

Target Languages

bull Limit character type to detect

ndash In short text detection mixed text can be

divided to type of characters

bull Latin alphabet language

ndash The most difficult alphabet type to detect

ndash Languages which speakers are over 5

million are more than 25

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 34

Whats Latin Alphabet

bull Latin alphabet ne ascii alphabet

ndash aring ą aelig eth Ħ ŋ and so on

bull They are assigned to 9 code blocks in Unicode

Range Name Supplement

U+0000-007F Basic Latin ascii

U+0080-00FF Latin-1 Supplement Most languages are covered with these U+0100-017F Latin Extended-A

U+0180-024F Latin Extended-B Rumanian

U+0250-02AF IPA Extensions

U+0300-036F Combining Diacritical Marks for tone symbol composition

U+1E00-1EFF Latin Extended Additional Vietnamese

U+2C60-2C7F Latin Extended-C These arenrsquot used by almost all present languages U+A720-A7FF Latin Extended-D

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 35

Latin Alphabets in Unicode Codepoint Chart

for Vietnamese only use often use sometimes

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 36

How to Create Corpus

bull Collect tweets with sample method of

twitter Streaming API

ndash Sampling 1 of all tweets (about 2

million tweets)

ndash Tweets in Latin alphabet language

account for 60 of them

bull The rest is only to annotate language

labels to these tweets

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 37

Language Label Annotation

bull Group tweets by their timezone

ndash French tweets account for about 1 of all ones

ndash But they account for 50 of ones in Paris

timezone only

bull Annotate tentative labels to tweets using

langdetect

ndash Remove non-French tweets from ones labeled lsquofrrsquo

ndash Recover French tweets from ones not labeled lsquofrrsquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 38

( 20 of the whole tweets have no timezone)

How to annotate

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 39

Swedish Norwegian Danish Vietnamese Lithuanian

Czech Hungarian Catalan Rumanian and Polish guides in turn

Created Corpus

bull Noiseless tweets for training data

bull Noiseful tweets with more than 3 words as test data

bull Work with Rauacutel Velaz and Hiroshi Manabe for Catalan corpus creation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 40

language training testca Catalan 9089 5082cs Czech 9082 7682da Dannish 7388 5524de German 44448 10065en English 44520 10168es Spanish 44118 10265fi Finnish 8087 7050fr French 44339 10098hu Hungarian 10030 4904id Indonesian 44722 10181it Italian 43366 10152nl Dutch 44682 10007no Norwegian 10124 8496pl Polish 16771 10152pt Portuguese 44215 10208ro Romanian 10021 5911sv Swedish 44054 10032tr Turkish 44703 10308vi Vietnamese 15030 10488

total 538789 166773

Simple Language Detection

bull Language detector can be constructed

from maximal substring model and

twitter corpus

ndash It still gets at most 98 accuracy

bull We guess it is necessary to reduce bias

ndash data size bias

ndash language-specific bias

ndash twitter-specific bias

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 41

Bias by Data Size

bull Tweet size in each language has huge bias

bull Level them out by sampling with replacement from each language up to the largest data

ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement

English

Portuguese

Spanish

Indonesian

Dutch

French

German

Turkish

Italian

Swedish

othersShort Text Language Detection with Infinity-Gram

(NAIST Seminar) 42

Convert to Lowercase on Multiple Languages

bull Conversion into lower case saves corpus and compresses model

bull But the lower case of I (U+0049) in Turkish differs from others

bull Convert to lower case excluding lsquoIrsquo

Upper case Lower case

Turkish

Azerbaijani

I (U+0049) ı (U+0131)

İ (U+0130) i (U+0069)

Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 43

Normalization for Rumanian

bull Rumanian uses acirc ă icirc ș ț in addition to a-z

bull There are 2 character type as st with a ldquobeardrdquo

ndash U+015E-F U+0162-3 st with cedilla

ndash U+0218-B st with comma below

bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia

bull The 2 code has the same design in some fonts

ndash Indistinguishable

ș ş U+0219 U+015F

ț ţ U+021B U+0163

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

44

Rumanian Character Affairs on PC

bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently

ndash 1989 Democratization in Rumania

ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode

ndash 2007 Rumania seated in the EU

ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 45

lsquost with cedillarsquo is used

on an advertisement board

in Bucharest

Normalization for Substitute Characters

bull lsquost with cedillarsquo are substitute characters

ndash But they are more popular than the others

ndash with cedilla with comma = 2 1

ndash ldquoRumanian IMErdquo outputs the substitutes too D

bull Regard lsquost with commarsquo as lsquost with cedillarsquo

ț ţ U+021B U+0163

I reckon it is similar to the relationship of

Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 46

Arabic Character Normalization (on language-detection)

bull Arabic and Persian have the similar trouble too

bull Character lsquoyehrsquo in Farsi corresponds to 2 code points

ndash Wikipedia uses ی (U+06cc Farsi yeh) only

ndash News uses ي(U+064a Arabic yeh) only

bull U+064a is a substitute in Farsi

ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc

ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails

bull Regard U+06cc as U+064a

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 47

Normalization for Vietnamese (1)

bull Vietnamese has 12 vowels

ndash a ă acirc e ecirc i y o ocirc ơ u ư

bull Vietnamese has 6 tones

ndash a ả agrave atilde aacute ạ

ndash These tone symbols are used also in general documents like news

bull The tone symbols can be appended to all vowels

ndash 12 6 = 72

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 48

Normalization for Vietnamese (2)

bull Representation of vowels with

tones

1 Use U+1ea0 - U+1ef9

bull ẵ = U+1eb5

2 Combine with Diacritical Marks

bull ẵ = U+0103 U+0303

ndash Half and half on news and tweet

bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 49

CJK-Kanji Normalization (1) (on language-detection)

bull CJK-Kanji has too many characters(more than 20K)

ndash Other character types have only 30-50 characters

bull The character space is very sparse

ndash Characters that donrsquot occur in the training corpus have no probabilities

bull eg 谢谢 Kanji for person name

ndash Common frequent characters are too strong

bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese

bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 50

CJK-Kanji Normalization (2) (on language-detection)

bull Group Kanjis by frequency and normalize each group to the representative character

ndash (1) K-means clustering

bull Use tf-idf on Wikipedia and Google News

bull K=50 (size of ascii alphabet = 52)

ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese

bull Simplified Chinese 现代汉语常用字表(3500)

bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)

bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998

ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much

bull Generate 130 clusters from product of (1) and (2)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 51

Normalization for twitter

bull Remove simply

ndash URL

ndash mention

ndash hash tag

ndash RT

ndash face mark using alphabet like XD p

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 52

Normalization for twitter-Specific Representation

bull How to Like lsquocoooooooollllllrsquo

bull Case 1 Make a normalization dictionary using [Brody+ 2011]

ndash Unsupervised normalization like coooollll rarr cool

ndash It canrsquot handle words that are not in the dictionary

bull Case 2 If the same character continues in more than 3 Shrink it to 2

ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of

bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on

bull Acronym (like WWW СССР) is not useful for language detection

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 53

Laugh Normalization

bull There are various laughs on each language

ndash HOW MUCH DO YOU LOVE COACH BEISTE

HHAHAHAHAHAH

ndash Hihihihi ) Habe ich regulaumlr 2x die Woche

ndash Tafil con eso Jajajajajajaja

ndash Malo Jejejeje XP

ndash kekeke chỗ đoacute lagravem aacuteo được ko em

bull Shrink them to double

ndash hahahha rArr haha

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 54

Implementation and Estimation

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 55

Language Detection with Infinity-Gram (ldig)

bull tweet language detection for Latin

alphabet

ndash httpsgithubcomshuyoldig

bull MIT license

bull Distribute also the trained model here

ndash infin-gram LR(maximal substring) [Okanohara+ 09]

ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]

ndash Double Array

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 56

Usage (1) Model Initialization

bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]

ndash Extract features from corpus and initialize model

ndash -m model directory

ndash -x path of maximal substring extractor (execute as external process)

ndash --ff Ignore less than the specified value

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 57

Maximal String Extractor

bull maxsubst [input file] [output file]

ndash Input as multiple line text

bull Replace TABs to ldquo ldquo line feeds to U+0001 in it

ndash Output as rdquo[features]yent[frequency]rdquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 58

Usage (2) Learn

bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]

ndash Learn the model using the corpus on 1 cycle of SGD

ndash -e learning rate of SGD

ndash -r regularizer of L1 regularization

ndash --wr what times to regularize for whole parameters

bull Parameters are too many to regularize the whle ones every step

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 59

Usage (3) Shrink Model

bull ldigpy -m [model] --shrink

ndash Remove Unefficient features(all

parameters of which are 0) from the

model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 60

Usage (4) Detect Language

bull ldigpy -m [model] [test data]

ndash Detect languages of test data and output

its result and summary

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 61

Data Format

bull Training and test data

ndash [correct label]yent[meta data]yent[text]

en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD

62

Usage (5) Estimation Tool

bull serverpy -m [model] -p [port number]

ndash Open httplocalhost[port] after it is executed

ndash Output their language probabilities contained features and their parameters for a text inputed in the text area

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 63

Estimation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus

As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size

64

language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992

total 166711 165053 9901 922 974

Estimation for LIGA dataset

bull Estimate using LIGA[Tromp+ 11] dataset

with 9066 tweets for 6 languages

ndash httpwwwwintuenl~mpechenprojectssmm

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Use 19 language model

65

Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996

total 9066 8992 992

Estimation for Europarl Dataset

Only supported languages for ldig

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 66

ldig langdetect CLDlanguage size correct rate correct rate correct rate

bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993

total 21000 13957 997 20850 993 20814 991

Conclusions

bull Language detector using maximal substring model

ndash Detect over 99 accuracy for 19 languages

ndash langdetect with tweet corpus even has 97 accuracy

bull If the corpus is maintained the precision will be still up

ndash There are still many mistakes (in particular da and no)

bull If metadata is added to features the precision will be still up

ndash How to add and train metadata at low cost

bull Desire to shrink the model without loss of precision

ndash Too large for application (gt100MB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 67

References

bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定

bull [Okanohara+ 09] Text Categorization with All Substring Features

bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs

bull [Cavnar+ 94] N-Gram-Based Text Categorization

bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 68

Page 32: Short Text Language Detection with Infinity-Gram

Target Languages

bull Limit character type to detect

ndash In short text detection mixed text can be

divided to type of characters

bull Latin alphabet language

ndash The most difficult alphabet type to detect

ndash Languages which speakers are over 5

million are more than 25

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 34

Whats Latin Alphabet

bull Latin alphabet ne ascii alphabet

ndash aring ą aelig eth Ħ ŋ and so on

bull They are assigned to 9 code blocks in Unicode

Range Name Supplement

U+0000-007F Basic Latin ascii

U+0080-00FF Latin-1 Supplement Most languages are covered with these U+0100-017F Latin Extended-A

U+0180-024F Latin Extended-B Rumanian

U+0250-02AF IPA Extensions

U+0300-036F Combining Diacritical Marks for tone symbol composition

U+1E00-1EFF Latin Extended Additional Vietnamese

U+2C60-2C7F Latin Extended-C These arenrsquot used by almost all present languages U+A720-A7FF Latin Extended-D

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 35

Latin Alphabets in Unicode Codepoint Chart

for Vietnamese only use often use sometimes

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 36

How to Create Corpus

bull Collect tweets with sample method of

twitter Streaming API

ndash Sampling 1 of all tweets (about 2

million tweets)

ndash Tweets in Latin alphabet language

account for 60 of them

bull The rest is only to annotate language

labels to these tweets

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 37

Language Label Annotation

bull Group tweets by their timezone

ndash French tweets account for about 1 of all ones

ndash But they account for 50 of ones in Paris

timezone only

bull Annotate tentative labels to tweets using

langdetect

ndash Remove non-French tweets from ones labeled lsquofrrsquo

ndash Recover French tweets from ones not labeled lsquofrrsquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 38

( 20 of the whole tweets have no timezone)

How to annotate

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 39

Swedish Norwegian Danish Vietnamese Lithuanian

Czech Hungarian Catalan Rumanian and Polish guides in turn

Created Corpus

bull Noiseless tweets for training data

bull Noiseful tweets with more than 3 words as test data

bull Work with Rauacutel Velaz and Hiroshi Manabe for Catalan corpus creation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 40

language training testca Catalan 9089 5082cs Czech 9082 7682da Dannish 7388 5524de German 44448 10065en English 44520 10168es Spanish 44118 10265fi Finnish 8087 7050fr French 44339 10098hu Hungarian 10030 4904id Indonesian 44722 10181it Italian 43366 10152nl Dutch 44682 10007no Norwegian 10124 8496pl Polish 16771 10152pt Portuguese 44215 10208ro Romanian 10021 5911sv Swedish 44054 10032tr Turkish 44703 10308vi Vietnamese 15030 10488

total 538789 166773

Simple Language Detection

bull Language detector can be constructed

from maximal substring model and

twitter corpus

ndash It still gets at most 98 accuracy

bull We guess it is necessary to reduce bias

ndash data size bias

ndash language-specific bias

ndash twitter-specific bias

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 41

Bias by Data Size

bull Tweet size in each language has huge bias

bull Level them out by sampling with replacement from each language up to the largest data

ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement

English

Portuguese

Spanish

Indonesian

Dutch

French

German

Turkish

Italian

Swedish

othersShort Text Language Detection with Infinity-Gram

(NAIST Seminar) 42

Convert to Lowercase on Multiple Languages

bull Conversion into lower case saves corpus and compresses model

bull But the lower case of I (U+0049) in Turkish differs from others

bull Convert to lower case excluding lsquoIrsquo

Upper case Lower case

Turkish

Azerbaijani

I (U+0049) ı (U+0131)

İ (U+0130) i (U+0069)

Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 43

Normalization for Rumanian

bull Rumanian uses acirc ă icirc ș ț in addition to a-z

bull There are 2 character type as st with a ldquobeardrdquo

ndash U+015E-F U+0162-3 st with cedilla

ndash U+0218-B st with comma below

bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia

bull The 2 code has the same design in some fonts

ndash Indistinguishable

ș ş U+0219 U+015F

ț ţ U+021B U+0163

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

44

Rumanian Character Affairs on PC

bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently

ndash 1989 Democratization in Rumania

ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode

ndash 2007 Rumania seated in the EU

ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 45

lsquost with cedillarsquo is used

on an advertisement board

in Bucharest

Normalization for Substitute Characters

bull lsquost with cedillarsquo are substitute characters

ndash But they are more popular than the others

ndash with cedilla with comma = 2 1

ndash ldquoRumanian IMErdquo outputs the substitutes too D

bull Regard lsquost with commarsquo as lsquost with cedillarsquo

ț ţ U+021B U+0163

I reckon it is similar to the relationship of

Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 46

Arabic Character Normalization (on language-detection)

bull Arabic and Persian have the similar trouble too

bull Character lsquoyehrsquo in Farsi corresponds to 2 code points

ndash Wikipedia uses ی (U+06cc Farsi yeh) only

ndash News uses ي(U+064a Arabic yeh) only

bull U+064a is a substitute in Farsi

ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc

ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails

bull Regard U+06cc as U+064a

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 47

Normalization for Vietnamese (1)

bull Vietnamese has 12 vowels

ndash a ă acirc e ecirc i y o ocirc ơ u ư

bull Vietnamese has 6 tones

ndash a ả agrave atilde aacute ạ

ndash These tone symbols are used also in general documents like news

bull The tone symbols can be appended to all vowels

ndash 12 6 = 72

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 48

Normalization for Vietnamese (2)

bull Representation of vowels with

tones

1 Use U+1ea0 - U+1ef9

bull ẵ = U+1eb5

2 Combine with Diacritical Marks

bull ẵ = U+0103 U+0303

ndash Half and half on news and tweet

bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 49

CJK-Kanji Normalization (1) (on language-detection)

bull CJK-Kanji has too many characters(more than 20K)

ndash Other character types have only 30-50 characters

bull The character space is very sparse

ndash Characters that donrsquot occur in the training corpus have no probabilities

bull eg 谢谢 Kanji for person name

ndash Common frequent characters are too strong

bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese

bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 50

CJK-Kanji Normalization (2) (on language-detection)

bull Group Kanjis by frequency and normalize each group to the representative character

ndash (1) K-means clustering

bull Use tf-idf on Wikipedia and Google News

bull K=50 (size of ascii alphabet = 52)

ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese

bull Simplified Chinese 现代汉语常用字表(3500)

bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)

bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998

ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much

bull Generate 130 clusters from product of (1) and (2)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 51

Normalization for twitter

bull Remove simply

ndash URL

ndash mention

ndash hash tag

ndash RT

ndash face mark using alphabet like XD p

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 52

Normalization for twitter-Specific Representation

bull How to Like lsquocoooooooollllllrsquo

bull Case 1 Make a normalization dictionary using [Brody+ 2011]

ndash Unsupervised normalization like coooollll rarr cool

ndash It canrsquot handle words that are not in the dictionary

bull Case 2 If the same character continues in more than 3 Shrink it to 2

ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of

bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on

bull Acronym (like WWW СССР) is not useful for language detection

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 53

Laugh Normalization

bull There are various laughs on each language

ndash HOW MUCH DO YOU LOVE COACH BEISTE

HHAHAHAHAHAH

ndash Hihihihi ) Habe ich regulaumlr 2x die Woche

ndash Tafil con eso Jajajajajajaja

ndash Malo Jejejeje XP

ndash kekeke chỗ đoacute lagravem aacuteo được ko em

bull Shrink them to double

ndash hahahha rArr haha

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 54

Implementation and Estimation

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 55

Language Detection with Infinity-Gram (ldig)

bull tweet language detection for Latin

alphabet

ndash httpsgithubcomshuyoldig

bull MIT license

bull Distribute also the trained model here

ndash infin-gram LR(maximal substring) [Okanohara+ 09]

ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]

ndash Double Array

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 56

Usage (1) Model Initialization

bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]

ndash Extract features from corpus and initialize model

ndash -m model directory

ndash -x path of maximal substring extractor (execute as external process)

ndash --ff Ignore less than the specified value

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 57

Maximal String Extractor

bull maxsubst [input file] [output file]

ndash Input as multiple line text

bull Replace TABs to ldquo ldquo line feeds to U+0001 in it

ndash Output as rdquo[features]yent[frequency]rdquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 58

Usage (2) Learn

bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]

ndash Learn the model using the corpus on 1 cycle of SGD

ndash -e learning rate of SGD

ndash -r regularizer of L1 regularization

ndash --wr what times to regularize for whole parameters

bull Parameters are too many to regularize the whle ones every step

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 59

Usage (3) Shrink Model

bull ldigpy -m [model] --shrink

ndash Remove Unefficient features(all

parameters of which are 0) from the

model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 60

Usage (4) Detect Language

bull ldigpy -m [model] [test data]

ndash Detect languages of test data and output

its result and summary

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 61

Data Format

bull Training and test data

ndash [correct label]yent[meta data]yent[text]

en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD

62

Usage (5) Estimation Tool

bull serverpy -m [model] -p [port number]

ndash Open httplocalhost[port] after it is executed

ndash Output their language probabilities contained features and their parameters for a text inputed in the text area

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 63

Estimation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus

As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size

64

language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992

total 166711 165053 9901 922 974

Estimation for LIGA dataset

bull Estimate using LIGA[Tromp+ 11] dataset

with 9066 tweets for 6 languages

ndash httpwwwwintuenl~mpechenprojectssmm

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Use 19 language model

65

Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996

total 9066 8992 992

Estimation for Europarl Dataset

Only supported languages for ldig

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 66

ldig langdetect CLDlanguage size correct rate correct rate correct rate

bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993

total 21000 13957 997 20850 993 20814 991

Conclusions

bull Language detector using maximal substring model

ndash Detect over 99 accuracy for 19 languages

ndash langdetect with tweet corpus even has 97 accuracy

bull If the corpus is maintained the precision will be still up

ndash There are still many mistakes (in particular da and no)

bull If metadata is added to features the precision will be still up

ndash How to add and train metadata at low cost

bull Desire to shrink the model without loss of precision

ndash Too large for application (gt100MB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 67

References

bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定

bull [Okanohara+ 09] Text Categorization with All Substring Features

bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs

bull [Cavnar+ 94] N-Gram-Based Text Categorization

bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 68

Page 33: Short Text Language Detection with Infinity-Gram

Whats Latin Alphabet

bull Latin alphabet ne ascii alphabet

ndash aring ą aelig eth Ħ ŋ and so on

bull They are assigned to 9 code blocks in Unicode

Range Name Supplement

U+0000-007F Basic Latin ascii

U+0080-00FF Latin-1 Supplement Most languages are covered with these U+0100-017F Latin Extended-A

U+0180-024F Latin Extended-B Rumanian

U+0250-02AF IPA Extensions

U+0300-036F Combining Diacritical Marks for tone symbol composition

U+1E00-1EFF Latin Extended Additional Vietnamese

U+2C60-2C7F Latin Extended-C These arenrsquot used by almost all present languages U+A720-A7FF Latin Extended-D

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 35

Latin Alphabets in Unicode Codepoint Chart

for Vietnamese only use often use sometimes

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 36

How to Create Corpus

bull Collect tweets with sample method of

twitter Streaming API

ndash Sampling 1 of all tweets (about 2

million tweets)

ndash Tweets in Latin alphabet language

account for 60 of them

bull The rest is only to annotate language

labels to these tweets

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 37

Language Label Annotation

bull Group tweets by their timezone

ndash French tweets account for about 1 of all ones

ndash But they account for 50 of ones in Paris

timezone only

bull Annotate tentative labels to tweets using

langdetect

ndash Remove non-French tweets from ones labeled lsquofrrsquo

ndash Recover French tweets from ones not labeled lsquofrrsquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 38

( 20 of the whole tweets have no timezone)

How to annotate

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 39

Swedish Norwegian Danish Vietnamese Lithuanian

Czech Hungarian Catalan Rumanian and Polish guides in turn

Created Corpus

bull Noiseless tweets for training data

bull Noiseful tweets with more than 3 words as test data

bull Work with Rauacutel Velaz and Hiroshi Manabe for Catalan corpus creation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 40

language training testca Catalan 9089 5082cs Czech 9082 7682da Dannish 7388 5524de German 44448 10065en English 44520 10168es Spanish 44118 10265fi Finnish 8087 7050fr French 44339 10098hu Hungarian 10030 4904id Indonesian 44722 10181it Italian 43366 10152nl Dutch 44682 10007no Norwegian 10124 8496pl Polish 16771 10152pt Portuguese 44215 10208ro Romanian 10021 5911sv Swedish 44054 10032tr Turkish 44703 10308vi Vietnamese 15030 10488

total 538789 166773

Simple Language Detection

bull Language detector can be constructed

from maximal substring model and

twitter corpus

ndash It still gets at most 98 accuracy

bull We guess it is necessary to reduce bias

ndash data size bias

ndash language-specific bias

ndash twitter-specific bias

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 41

Bias by Data Size

bull Tweet size in each language has huge bias

bull Level them out by sampling with replacement from each language up to the largest data

ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement

English

Portuguese

Spanish

Indonesian

Dutch

French

German

Turkish

Italian

Swedish

othersShort Text Language Detection with Infinity-Gram

(NAIST Seminar) 42

Convert to Lowercase on Multiple Languages

bull Conversion into lower case saves corpus and compresses model

bull But the lower case of I (U+0049) in Turkish differs from others

bull Convert to lower case excluding lsquoIrsquo

Upper case Lower case

Turkish

Azerbaijani

I (U+0049) ı (U+0131)

İ (U+0130) i (U+0069)

Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 43

Normalization for Rumanian

bull Rumanian uses acirc ă icirc ș ț in addition to a-z

bull There are 2 character type as st with a ldquobeardrdquo

ndash U+015E-F U+0162-3 st with cedilla

ndash U+0218-B st with comma below

bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia

bull The 2 code has the same design in some fonts

ndash Indistinguishable

ș ş U+0219 U+015F

ț ţ U+021B U+0163

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

44

Rumanian Character Affairs on PC

bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently

ndash 1989 Democratization in Rumania

ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode

ndash 2007 Rumania seated in the EU

ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 45

lsquost with cedillarsquo is used

on an advertisement board

in Bucharest

Normalization for Substitute Characters

bull lsquost with cedillarsquo are substitute characters

ndash But they are more popular than the others

ndash with cedilla with comma = 2 1

ndash ldquoRumanian IMErdquo outputs the substitutes too D

bull Regard lsquost with commarsquo as lsquost with cedillarsquo

ț ţ U+021B U+0163

I reckon it is similar to the relationship of

Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 46

Arabic Character Normalization (on language-detection)

bull Arabic and Persian have the similar trouble too

bull Character lsquoyehrsquo in Farsi corresponds to 2 code points

ndash Wikipedia uses ی (U+06cc Farsi yeh) only

ndash News uses ي(U+064a Arabic yeh) only

bull U+064a is a substitute in Farsi

ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc

ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails

bull Regard U+06cc as U+064a

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 47

Normalization for Vietnamese (1)

bull Vietnamese has 12 vowels

ndash a ă acirc e ecirc i y o ocirc ơ u ư

bull Vietnamese has 6 tones

ndash a ả agrave atilde aacute ạ

ndash These tone symbols are used also in general documents like news

bull The tone symbols can be appended to all vowels

ndash 12 6 = 72

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 48

Normalization for Vietnamese (2)

bull Representation of vowels with

tones

1 Use U+1ea0 - U+1ef9

bull ẵ = U+1eb5

2 Combine with Diacritical Marks

bull ẵ = U+0103 U+0303

ndash Half and half on news and tweet

bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 49

CJK-Kanji Normalization (1) (on language-detection)

bull CJK-Kanji has too many characters(more than 20K)

ndash Other character types have only 30-50 characters

bull The character space is very sparse

ndash Characters that donrsquot occur in the training corpus have no probabilities

bull eg 谢谢 Kanji for person name

ndash Common frequent characters are too strong

bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese

bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 50

CJK-Kanji Normalization (2) (on language-detection)

bull Group Kanjis by frequency and normalize each group to the representative character

ndash (1) K-means clustering

bull Use tf-idf on Wikipedia and Google News

bull K=50 (size of ascii alphabet = 52)

ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese

bull Simplified Chinese 现代汉语常用字表(3500)

bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)

bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998

ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much

bull Generate 130 clusters from product of (1) and (2)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 51

Normalization for twitter

bull Remove simply

ndash URL

ndash mention

ndash hash tag

ndash RT

ndash face mark using alphabet like XD p

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 52

Normalization for twitter-Specific Representation

bull How to Like lsquocoooooooollllllrsquo

bull Case 1 Make a normalization dictionary using [Brody+ 2011]

ndash Unsupervised normalization like coooollll rarr cool

ndash It canrsquot handle words that are not in the dictionary

bull Case 2 If the same character continues in more than 3 Shrink it to 2

ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of

bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on

bull Acronym (like WWW СССР) is not useful for language detection

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 53

Laugh Normalization

bull There are various laughs on each language

ndash HOW MUCH DO YOU LOVE COACH BEISTE

HHAHAHAHAHAH

ndash Hihihihi ) Habe ich regulaumlr 2x die Woche

ndash Tafil con eso Jajajajajajaja

ndash Malo Jejejeje XP

ndash kekeke chỗ đoacute lagravem aacuteo được ko em

bull Shrink them to double

ndash hahahha rArr haha

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 54

Implementation and Estimation

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 55

Language Detection with Infinity-Gram (ldig)

bull tweet language detection for Latin

alphabet

ndash httpsgithubcomshuyoldig

bull MIT license

bull Distribute also the trained model here

ndash infin-gram LR(maximal substring) [Okanohara+ 09]

ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]

ndash Double Array

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 56

Usage (1) Model Initialization

bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]

ndash Extract features from corpus and initialize model

ndash -m model directory

ndash -x path of maximal substring extractor (execute as external process)

ndash --ff Ignore less than the specified value

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 57

Maximal String Extractor

bull maxsubst [input file] [output file]

ndash Input as multiple line text

bull Replace TABs to ldquo ldquo line feeds to U+0001 in it

ndash Output as rdquo[features]yent[frequency]rdquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 58

Usage (2) Learn

bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]

ndash Learn the model using the corpus on 1 cycle of SGD

ndash -e learning rate of SGD

ndash -r regularizer of L1 regularization

ndash --wr what times to regularize for whole parameters

bull Parameters are too many to regularize the whle ones every step

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 59

Usage (3) Shrink Model

bull ldigpy -m [model] --shrink

ndash Remove Unefficient features(all

parameters of which are 0) from the

model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 60

Usage (4) Detect Language

bull ldigpy -m [model] [test data]

ndash Detect languages of test data and output

its result and summary

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 61

Data Format

bull Training and test data

ndash [correct label]yent[meta data]yent[text]

en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD

62

Usage (5) Estimation Tool

bull serverpy -m [model] -p [port number]

ndash Open httplocalhost[port] after it is executed

ndash Output their language probabilities contained features and their parameters for a text inputed in the text area

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 63

Estimation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus

As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size

64

language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992

total 166711 165053 9901 922 974

Estimation for LIGA dataset

bull Estimate using LIGA[Tromp+ 11] dataset

with 9066 tweets for 6 languages

ndash httpwwwwintuenl~mpechenprojectssmm

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Use 19 language model

65

Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996

total 9066 8992 992

Estimation for Europarl Dataset

Only supported languages for ldig

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 66

ldig langdetect CLDlanguage size correct rate correct rate correct rate

bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993

total 21000 13957 997 20850 993 20814 991

Conclusions

bull Language detector using maximal substring model

ndash Detect over 99 accuracy for 19 languages

ndash langdetect with tweet corpus even has 97 accuracy

bull If the corpus is maintained the precision will be still up

ndash There are still many mistakes (in particular da and no)

bull If metadata is added to features the precision will be still up

ndash How to add and train metadata at low cost

bull Desire to shrink the model without loss of precision

ndash Too large for application (gt100MB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 67

References

bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定

bull [Okanohara+ 09] Text Categorization with All Substring Features

bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs

bull [Cavnar+ 94] N-Gram-Based Text Categorization

bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 68

Page 34: Short Text Language Detection with Infinity-Gram

Latin Alphabets in Unicode Codepoint Chart

for Vietnamese only use often use sometimes

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 36

How to Create Corpus

bull Collect tweets with sample method of

twitter Streaming API

ndash Sampling 1 of all tweets (about 2

million tweets)

ndash Tweets in Latin alphabet language

account for 60 of them

bull The rest is only to annotate language

labels to these tweets

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 37

Language Label Annotation

bull Group tweets by their timezone

ndash French tweets account for about 1 of all ones

ndash But they account for 50 of ones in Paris

timezone only

bull Annotate tentative labels to tweets using

langdetect

ndash Remove non-French tweets from ones labeled lsquofrrsquo

ndash Recover French tweets from ones not labeled lsquofrrsquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 38

( 20 of the whole tweets have no timezone)

How to annotate

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 39

Swedish Norwegian Danish Vietnamese Lithuanian

Czech Hungarian Catalan Rumanian and Polish guides in turn

Created Corpus

bull Noiseless tweets for training data

bull Noiseful tweets with more than 3 words as test data

bull Work with Rauacutel Velaz and Hiroshi Manabe for Catalan corpus creation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 40

language training testca Catalan 9089 5082cs Czech 9082 7682da Dannish 7388 5524de German 44448 10065en English 44520 10168es Spanish 44118 10265fi Finnish 8087 7050fr French 44339 10098hu Hungarian 10030 4904id Indonesian 44722 10181it Italian 43366 10152nl Dutch 44682 10007no Norwegian 10124 8496pl Polish 16771 10152pt Portuguese 44215 10208ro Romanian 10021 5911sv Swedish 44054 10032tr Turkish 44703 10308vi Vietnamese 15030 10488

total 538789 166773

Simple Language Detection

bull Language detector can be constructed

from maximal substring model and

twitter corpus

ndash It still gets at most 98 accuracy

bull We guess it is necessary to reduce bias

ndash data size bias

ndash language-specific bias

ndash twitter-specific bias

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 41

Bias by Data Size

bull Tweet size in each language has huge bias

bull Level them out by sampling with replacement from each language up to the largest data

ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement

English

Portuguese

Spanish

Indonesian

Dutch

French

German

Turkish

Italian

Swedish

othersShort Text Language Detection with Infinity-Gram

(NAIST Seminar) 42

Convert to Lowercase on Multiple Languages

bull Conversion into lower case saves corpus and compresses model

bull But the lower case of I (U+0049) in Turkish differs from others

bull Convert to lower case excluding lsquoIrsquo

Upper case Lower case

Turkish

Azerbaijani

I (U+0049) ı (U+0131)

İ (U+0130) i (U+0069)

Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 43

Normalization for Rumanian

bull Rumanian uses acirc ă icirc ș ț in addition to a-z

bull There are 2 character type as st with a ldquobeardrdquo

ndash U+015E-F U+0162-3 st with cedilla

ndash U+0218-B st with comma below

bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia

bull The 2 code has the same design in some fonts

ndash Indistinguishable

ș ş U+0219 U+015F

ț ţ U+021B U+0163

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

44

Rumanian Character Affairs on PC

bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently

ndash 1989 Democratization in Rumania

ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode

ndash 2007 Rumania seated in the EU

ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 45

lsquost with cedillarsquo is used

on an advertisement board

in Bucharest

Normalization for Substitute Characters

bull lsquost with cedillarsquo are substitute characters

ndash But they are more popular than the others

ndash with cedilla with comma = 2 1

ndash ldquoRumanian IMErdquo outputs the substitutes too D

bull Regard lsquost with commarsquo as lsquost with cedillarsquo

ț ţ U+021B U+0163

I reckon it is similar to the relationship of

Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 46

Arabic Character Normalization (on language-detection)

bull Arabic and Persian have the similar trouble too

bull Character lsquoyehrsquo in Farsi corresponds to 2 code points

ndash Wikipedia uses ی (U+06cc Farsi yeh) only

ndash News uses ي(U+064a Arabic yeh) only

bull U+064a is a substitute in Farsi

ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc

ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails

bull Regard U+06cc as U+064a

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 47

Normalization for Vietnamese (1)

bull Vietnamese has 12 vowels

ndash a ă acirc e ecirc i y o ocirc ơ u ư

bull Vietnamese has 6 tones

ndash a ả agrave atilde aacute ạ

ndash These tone symbols are used also in general documents like news

bull The tone symbols can be appended to all vowels

ndash 12 6 = 72

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 48

Normalization for Vietnamese (2)

bull Representation of vowels with

tones

1 Use U+1ea0 - U+1ef9

bull ẵ = U+1eb5

2 Combine with Diacritical Marks

bull ẵ = U+0103 U+0303

ndash Half and half on news and tweet

bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 49

CJK-Kanji Normalization (1) (on language-detection)

bull CJK-Kanji has too many characters(more than 20K)

ndash Other character types have only 30-50 characters

bull The character space is very sparse

ndash Characters that donrsquot occur in the training corpus have no probabilities

bull eg 谢谢 Kanji for person name

ndash Common frequent characters are too strong

bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese

bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 50

CJK-Kanji Normalization (2) (on language-detection)

bull Group Kanjis by frequency and normalize each group to the representative character

ndash (1) K-means clustering

bull Use tf-idf on Wikipedia and Google News

bull K=50 (size of ascii alphabet = 52)

ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese

bull Simplified Chinese 现代汉语常用字表(3500)

bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)

bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998

ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much

bull Generate 130 clusters from product of (1) and (2)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 51

Normalization for twitter

bull Remove simply

ndash URL

ndash mention

ndash hash tag

ndash RT

ndash face mark using alphabet like XD p

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 52

Normalization for twitter-Specific Representation

bull How to Like lsquocoooooooollllllrsquo

bull Case 1 Make a normalization dictionary using [Brody+ 2011]

ndash Unsupervised normalization like coooollll rarr cool

ndash It canrsquot handle words that are not in the dictionary

bull Case 2 If the same character continues in more than 3 Shrink it to 2

ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of

bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on

bull Acronym (like WWW СССР) is not useful for language detection

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 53

Laugh Normalization

bull There are various laughs on each language

ndash HOW MUCH DO YOU LOVE COACH BEISTE

HHAHAHAHAHAH

ndash Hihihihi ) Habe ich regulaumlr 2x die Woche

ndash Tafil con eso Jajajajajajaja

ndash Malo Jejejeje XP

ndash kekeke chỗ đoacute lagravem aacuteo được ko em

bull Shrink them to double

ndash hahahha rArr haha

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 54

Implementation and Estimation

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 55

Language Detection with Infinity-Gram (ldig)

bull tweet language detection for Latin

alphabet

ndash httpsgithubcomshuyoldig

bull MIT license

bull Distribute also the trained model here

ndash infin-gram LR(maximal substring) [Okanohara+ 09]

ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]

ndash Double Array

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 56

Usage (1) Model Initialization

bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]

ndash Extract features from corpus and initialize model

ndash -m model directory

ndash -x path of maximal substring extractor (execute as external process)

ndash --ff Ignore less than the specified value

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 57

Maximal String Extractor

bull maxsubst [input file] [output file]

ndash Input as multiple line text

bull Replace TABs to ldquo ldquo line feeds to U+0001 in it

ndash Output as rdquo[features]yent[frequency]rdquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 58

Usage (2) Learn

bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]

ndash Learn the model using the corpus on 1 cycle of SGD

ndash -e learning rate of SGD

ndash -r regularizer of L1 regularization

ndash --wr what times to regularize for whole parameters

bull Parameters are too many to regularize the whle ones every step

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 59

Usage (3) Shrink Model

bull ldigpy -m [model] --shrink

ndash Remove Unefficient features(all

parameters of which are 0) from the

model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 60

Usage (4) Detect Language

bull ldigpy -m [model] [test data]

ndash Detect languages of test data and output

its result and summary

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 61

Data Format

bull Training and test data

ndash [correct label]yent[meta data]yent[text]

en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD

62

Usage (5) Estimation Tool

bull serverpy -m [model] -p [port number]

ndash Open httplocalhost[port] after it is executed

ndash Output their language probabilities contained features and their parameters for a text inputed in the text area

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 63

Estimation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus

As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size

64

language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992

total 166711 165053 9901 922 974

Estimation for LIGA dataset

bull Estimate using LIGA[Tromp+ 11] dataset

with 9066 tweets for 6 languages

ndash httpwwwwintuenl~mpechenprojectssmm

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Use 19 language model

65

Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996

total 9066 8992 992

Estimation for Europarl Dataset

Only supported languages for ldig

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 66

ldig langdetect CLDlanguage size correct rate correct rate correct rate

bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993

total 21000 13957 997 20850 993 20814 991

Conclusions

bull Language detector using maximal substring model

ndash Detect over 99 accuracy for 19 languages

ndash langdetect with tweet corpus even has 97 accuracy

bull If the corpus is maintained the precision will be still up

ndash There are still many mistakes (in particular da and no)

bull If metadata is added to features the precision will be still up

ndash How to add and train metadata at low cost

bull Desire to shrink the model without loss of precision

ndash Too large for application (gt100MB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 67

References

bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定

bull [Okanohara+ 09] Text Categorization with All Substring Features

bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs

bull [Cavnar+ 94] N-Gram-Based Text Categorization

bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 68

Page 35: Short Text Language Detection with Infinity-Gram

How to Create Corpus

bull Collect tweets with sample method of

twitter Streaming API

ndash Sampling 1 of all tweets (about 2

million tweets)

ndash Tweets in Latin alphabet language

account for 60 of them

bull The rest is only to annotate language

labels to these tweets

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 37

Language Label Annotation

bull Group tweets by their timezone

ndash French tweets account for about 1 of all ones

ndash But they account for 50 of ones in Paris

timezone only

bull Annotate tentative labels to tweets using

langdetect

ndash Remove non-French tweets from ones labeled lsquofrrsquo

ndash Recover French tweets from ones not labeled lsquofrrsquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 38

( 20 of the whole tweets have no timezone)

How to annotate

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 39

Swedish Norwegian Danish Vietnamese Lithuanian

Czech Hungarian Catalan Rumanian and Polish guides in turn

Created Corpus

bull Noiseless tweets for training data

bull Noiseful tweets with more than 3 words as test data

bull Work with Rauacutel Velaz and Hiroshi Manabe for Catalan corpus creation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 40

language training testca Catalan 9089 5082cs Czech 9082 7682da Dannish 7388 5524de German 44448 10065en English 44520 10168es Spanish 44118 10265fi Finnish 8087 7050fr French 44339 10098hu Hungarian 10030 4904id Indonesian 44722 10181it Italian 43366 10152nl Dutch 44682 10007no Norwegian 10124 8496pl Polish 16771 10152pt Portuguese 44215 10208ro Romanian 10021 5911sv Swedish 44054 10032tr Turkish 44703 10308vi Vietnamese 15030 10488

total 538789 166773

Simple Language Detection

bull Language detector can be constructed

from maximal substring model and

twitter corpus

ndash It still gets at most 98 accuracy

bull We guess it is necessary to reduce bias

ndash data size bias

ndash language-specific bias

ndash twitter-specific bias

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 41

Bias by Data Size

bull Tweet size in each language has huge bias

bull Level them out by sampling with replacement from each language up to the largest data

ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement

English

Portuguese

Spanish

Indonesian

Dutch

French

German

Turkish

Italian

Swedish

othersShort Text Language Detection with Infinity-Gram

(NAIST Seminar) 42

Convert to Lowercase on Multiple Languages

bull Conversion into lower case saves corpus and compresses model

bull But the lower case of I (U+0049) in Turkish differs from others

bull Convert to lower case excluding lsquoIrsquo

Upper case Lower case

Turkish

Azerbaijani

I (U+0049) ı (U+0131)

İ (U+0130) i (U+0069)

Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 43

Normalization for Rumanian

bull Rumanian uses acirc ă icirc ș ț in addition to a-z

bull There are 2 character type as st with a ldquobeardrdquo

ndash U+015E-F U+0162-3 st with cedilla

ndash U+0218-B st with comma below

bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia

bull The 2 code has the same design in some fonts

ndash Indistinguishable

ș ş U+0219 U+015F

ț ţ U+021B U+0163

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

44

Rumanian Character Affairs on PC

bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently

ndash 1989 Democratization in Rumania

ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode

ndash 2007 Rumania seated in the EU

ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 45

lsquost with cedillarsquo is used

on an advertisement board

in Bucharest

Normalization for Substitute Characters

bull lsquost with cedillarsquo are substitute characters

ndash But they are more popular than the others

ndash with cedilla with comma = 2 1

ndash ldquoRumanian IMErdquo outputs the substitutes too D

bull Regard lsquost with commarsquo as lsquost with cedillarsquo

ț ţ U+021B U+0163

I reckon it is similar to the relationship of

Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 46

Arabic Character Normalization (on language-detection)

bull Arabic and Persian have the similar trouble too

bull Character lsquoyehrsquo in Farsi corresponds to 2 code points

ndash Wikipedia uses ی (U+06cc Farsi yeh) only

ndash News uses ي(U+064a Arabic yeh) only

bull U+064a is a substitute in Farsi

ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc

ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails

bull Regard U+06cc as U+064a

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 47

Normalization for Vietnamese (1)

bull Vietnamese has 12 vowels

ndash a ă acirc e ecirc i y o ocirc ơ u ư

bull Vietnamese has 6 tones

ndash a ả agrave atilde aacute ạ

ndash These tone symbols are used also in general documents like news

bull The tone symbols can be appended to all vowels

ndash 12 6 = 72

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 48

Normalization for Vietnamese (2)

bull Representation of vowels with

tones

1 Use U+1ea0 - U+1ef9

bull ẵ = U+1eb5

2 Combine with Diacritical Marks

bull ẵ = U+0103 U+0303

ndash Half and half on news and tweet

bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 49

CJK-Kanji Normalization (1) (on language-detection)

bull CJK-Kanji has too many characters(more than 20K)

ndash Other character types have only 30-50 characters

bull The character space is very sparse

ndash Characters that donrsquot occur in the training corpus have no probabilities

bull eg 谢谢 Kanji for person name

ndash Common frequent characters are too strong

bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese

bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 50

CJK-Kanji Normalization (2) (on language-detection)

bull Group Kanjis by frequency and normalize each group to the representative character

ndash (1) K-means clustering

bull Use tf-idf on Wikipedia and Google News

bull K=50 (size of ascii alphabet = 52)

ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese

bull Simplified Chinese 现代汉语常用字表(3500)

bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)

bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998

ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much

bull Generate 130 clusters from product of (1) and (2)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 51

Normalization for twitter

bull Remove simply

ndash URL

ndash mention

ndash hash tag

ndash RT

ndash face mark using alphabet like XD p

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 52

Normalization for twitter-Specific Representation

bull How to Like lsquocoooooooollllllrsquo

bull Case 1 Make a normalization dictionary using [Brody+ 2011]

ndash Unsupervised normalization like coooollll rarr cool

ndash It canrsquot handle words that are not in the dictionary

bull Case 2 If the same character continues in more than 3 Shrink it to 2

ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of

bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on

bull Acronym (like WWW СССР) is not useful for language detection

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 53

Laugh Normalization

bull There are various laughs on each language

ndash HOW MUCH DO YOU LOVE COACH BEISTE

HHAHAHAHAHAH

ndash Hihihihi ) Habe ich regulaumlr 2x die Woche

ndash Tafil con eso Jajajajajajaja

ndash Malo Jejejeje XP

ndash kekeke chỗ đoacute lagravem aacuteo được ko em

bull Shrink them to double

ndash hahahha rArr haha

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 54

Implementation and Estimation

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 55

Language Detection with Infinity-Gram (ldig)

bull tweet language detection for Latin

alphabet

ndash httpsgithubcomshuyoldig

bull MIT license

bull Distribute also the trained model here

ndash infin-gram LR(maximal substring) [Okanohara+ 09]

ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]

ndash Double Array

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 56

Usage (1) Model Initialization

bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]

ndash Extract features from corpus and initialize model

ndash -m model directory

ndash -x path of maximal substring extractor (execute as external process)

ndash --ff Ignore less than the specified value

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 57

Maximal String Extractor

bull maxsubst [input file] [output file]

ndash Input as multiple line text

bull Replace TABs to ldquo ldquo line feeds to U+0001 in it

ndash Output as rdquo[features]yent[frequency]rdquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 58

Usage (2) Learn

bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]

ndash Learn the model using the corpus on 1 cycle of SGD

ndash -e learning rate of SGD

ndash -r regularizer of L1 regularization

ndash --wr what times to regularize for whole parameters

bull Parameters are too many to regularize the whle ones every step

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 59

Usage (3) Shrink Model

bull ldigpy -m [model] --shrink

ndash Remove Unefficient features(all

parameters of which are 0) from the

model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 60

Usage (4) Detect Language

bull ldigpy -m [model] [test data]

ndash Detect languages of test data and output

its result and summary

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 61

Data Format

bull Training and test data

ndash [correct label]yent[meta data]yent[text]

en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD

62

Usage (5) Estimation Tool

bull serverpy -m [model] -p [port number]

ndash Open httplocalhost[port] after it is executed

ndash Output their language probabilities contained features and their parameters for a text inputed in the text area

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 63

Estimation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus

As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size

64

language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992

total 166711 165053 9901 922 974

Estimation for LIGA dataset

bull Estimate using LIGA[Tromp+ 11] dataset

with 9066 tweets for 6 languages

ndash httpwwwwintuenl~mpechenprojectssmm

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Use 19 language model

65

Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996

total 9066 8992 992

Estimation for Europarl Dataset

Only supported languages for ldig

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 66

ldig langdetect CLDlanguage size correct rate correct rate correct rate

bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993

total 21000 13957 997 20850 993 20814 991

Conclusions

bull Language detector using maximal substring model

ndash Detect over 99 accuracy for 19 languages

ndash langdetect with tweet corpus even has 97 accuracy

bull If the corpus is maintained the precision will be still up

ndash There are still many mistakes (in particular da and no)

bull If metadata is added to features the precision will be still up

ndash How to add and train metadata at low cost

bull Desire to shrink the model without loss of precision

ndash Too large for application (gt100MB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 67

References

bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定

bull [Okanohara+ 09] Text Categorization with All Substring Features

bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs

bull [Cavnar+ 94] N-Gram-Based Text Categorization

bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 68

Page 36: Short Text Language Detection with Infinity-Gram

Language Label Annotation

bull Group tweets by their timezone

ndash French tweets account for about 1 of all ones

ndash But they account for 50 of ones in Paris

timezone only

bull Annotate tentative labels to tweets using

langdetect

ndash Remove non-French tweets from ones labeled lsquofrrsquo

ndash Recover French tweets from ones not labeled lsquofrrsquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 38

( 20 of the whole tweets have no timezone)

How to annotate

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 39

Swedish Norwegian Danish Vietnamese Lithuanian

Czech Hungarian Catalan Rumanian and Polish guides in turn

Created Corpus

bull Noiseless tweets for training data

bull Noiseful tweets with more than 3 words as test data

bull Work with Rauacutel Velaz and Hiroshi Manabe for Catalan corpus creation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 40

language training testca Catalan 9089 5082cs Czech 9082 7682da Dannish 7388 5524de German 44448 10065en English 44520 10168es Spanish 44118 10265fi Finnish 8087 7050fr French 44339 10098hu Hungarian 10030 4904id Indonesian 44722 10181it Italian 43366 10152nl Dutch 44682 10007no Norwegian 10124 8496pl Polish 16771 10152pt Portuguese 44215 10208ro Romanian 10021 5911sv Swedish 44054 10032tr Turkish 44703 10308vi Vietnamese 15030 10488

total 538789 166773

Simple Language Detection

bull Language detector can be constructed

from maximal substring model and

twitter corpus

ndash It still gets at most 98 accuracy

bull We guess it is necessary to reduce bias

ndash data size bias

ndash language-specific bias

ndash twitter-specific bias

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 41

Bias by Data Size

bull Tweet size in each language has huge bias

bull Level them out by sampling with replacement from each language up to the largest data

ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement

English

Portuguese

Spanish

Indonesian

Dutch

French

German

Turkish

Italian

Swedish

othersShort Text Language Detection with Infinity-Gram

(NAIST Seminar) 42

Convert to Lowercase on Multiple Languages

bull Conversion into lower case saves corpus and compresses model

bull But the lower case of I (U+0049) in Turkish differs from others

bull Convert to lower case excluding lsquoIrsquo

Upper case Lower case

Turkish

Azerbaijani

I (U+0049) ı (U+0131)

İ (U+0130) i (U+0069)

Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 43

Normalization for Rumanian

bull Rumanian uses acirc ă icirc ș ț in addition to a-z

bull There are 2 character type as st with a ldquobeardrdquo

ndash U+015E-F U+0162-3 st with cedilla

ndash U+0218-B st with comma below

bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia

bull The 2 code has the same design in some fonts

ndash Indistinguishable

ș ş U+0219 U+015F

ț ţ U+021B U+0163

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

44

Rumanian Character Affairs on PC

bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently

ndash 1989 Democratization in Rumania

ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode

ndash 2007 Rumania seated in the EU

ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 45

lsquost with cedillarsquo is used

on an advertisement board

in Bucharest

Normalization for Substitute Characters

bull lsquost with cedillarsquo are substitute characters

ndash But they are more popular than the others

ndash with cedilla with comma = 2 1

ndash ldquoRumanian IMErdquo outputs the substitutes too D

bull Regard lsquost with commarsquo as lsquost with cedillarsquo

ț ţ U+021B U+0163

I reckon it is similar to the relationship of

Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 46

Arabic Character Normalization (on language-detection)

bull Arabic and Persian have the similar trouble too

bull Character lsquoyehrsquo in Farsi corresponds to 2 code points

ndash Wikipedia uses ی (U+06cc Farsi yeh) only

ndash News uses ي(U+064a Arabic yeh) only

bull U+064a is a substitute in Farsi

ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc

ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails

bull Regard U+06cc as U+064a

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 47

Normalization for Vietnamese (1)

bull Vietnamese has 12 vowels

ndash a ă acirc e ecirc i y o ocirc ơ u ư

bull Vietnamese has 6 tones

ndash a ả agrave atilde aacute ạ

ndash These tone symbols are used also in general documents like news

bull The tone symbols can be appended to all vowels

ndash 12 6 = 72

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 48

Normalization for Vietnamese (2)

bull Representation of vowels with

tones

1 Use U+1ea0 - U+1ef9

bull ẵ = U+1eb5

2 Combine with Diacritical Marks

bull ẵ = U+0103 U+0303

ndash Half and half on news and tweet

bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 49

CJK-Kanji Normalization (1) (on language-detection)

bull CJK-Kanji has too many characters(more than 20K)

ndash Other character types have only 30-50 characters

bull The character space is very sparse

ndash Characters that donrsquot occur in the training corpus have no probabilities

bull eg 谢谢 Kanji for person name

ndash Common frequent characters are too strong

bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese

bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 50

CJK-Kanji Normalization (2) (on language-detection)

bull Group Kanjis by frequency and normalize each group to the representative character

ndash (1) K-means clustering

bull Use tf-idf on Wikipedia and Google News

bull K=50 (size of ascii alphabet = 52)

ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese

bull Simplified Chinese 现代汉语常用字表(3500)

bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)

bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998

ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much

bull Generate 130 clusters from product of (1) and (2)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 51

Normalization for twitter

bull Remove simply

ndash URL

ndash mention

ndash hash tag

ndash RT

ndash face mark using alphabet like XD p

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 52

Normalization for twitter-Specific Representation

bull How to Like lsquocoooooooollllllrsquo

bull Case 1 Make a normalization dictionary using [Brody+ 2011]

ndash Unsupervised normalization like coooollll rarr cool

ndash It canrsquot handle words that are not in the dictionary

bull Case 2 If the same character continues in more than 3 Shrink it to 2

ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of

bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on

bull Acronym (like WWW СССР) is not useful for language detection

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 53

Laugh Normalization

bull There are various laughs on each language

ndash HOW MUCH DO YOU LOVE COACH BEISTE

HHAHAHAHAHAH

ndash Hihihihi ) Habe ich regulaumlr 2x die Woche

ndash Tafil con eso Jajajajajajaja

ndash Malo Jejejeje XP

ndash kekeke chỗ đoacute lagravem aacuteo được ko em

bull Shrink them to double

ndash hahahha rArr haha

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 54

Implementation and Estimation

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 55

Language Detection with Infinity-Gram (ldig)

bull tweet language detection for Latin

alphabet

ndash httpsgithubcomshuyoldig

bull MIT license

bull Distribute also the trained model here

ndash infin-gram LR(maximal substring) [Okanohara+ 09]

ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]

ndash Double Array

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 56

Usage (1) Model Initialization

bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]

ndash Extract features from corpus and initialize model

ndash -m model directory

ndash -x path of maximal substring extractor (execute as external process)

ndash --ff Ignore less than the specified value

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 57

Maximal String Extractor

bull maxsubst [input file] [output file]

ndash Input as multiple line text

bull Replace TABs to ldquo ldquo line feeds to U+0001 in it

ndash Output as rdquo[features]yent[frequency]rdquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 58

Usage (2) Learn

bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]

ndash Learn the model using the corpus on 1 cycle of SGD

ndash -e learning rate of SGD

ndash -r regularizer of L1 regularization

ndash --wr what times to regularize for whole parameters

bull Parameters are too many to regularize the whle ones every step

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 59

Usage (3) Shrink Model

bull ldigpy -m [model] --shrink

ndash Remove Unefficient features(all

parameters of which are 0) from the

model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 60

Usage (4) Detect Language

bull ldigpy -m [model] [test data]

ndash Detect languages of test data and output

its result and summary

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 61

Data Format

bull Training and test data

ndash [correct label]yent[meta data]yent[text]

en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD

62

Usage (5) Estimation Tool

bull serverpy -m [model] -p [port number]

ndash Open httplocalhost[port] after it is executed

ndash Output their language probabilities contained features and their parameters for a text inputed in the text area

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 63

Estimation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus

As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size

64

language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992

total 166711 165053 9901 922 974

Estimation for LIGA dataset

bull Estimate using LIGA[Tromp+ 11] dataset

with 9066 tweets for 6 languages

ndash httpwwwwintuenl~mpechenprojectssmm

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Use 19 language model

65

Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996

total 9066 8992 992

Estimation for Europarl Dataset

Only supported languages for ldig

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 66

ldig langdetect CLDlanguage size correct rate correct rate correct rate

bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993

total 21000 13957 997 20850 993 20814 991

Conclusions

bull Language detector using maximal substring model

ndash Detect over 99 accuracy for 19 languages

ndash langdetect with tweet corpus even has 97 accuracy

bull If the corpus is maintained the precision will be still up

ndash There are still many mistakes (in particular da and no)

bull If metadata is added to features the precision will be still up

ndash How to add and train metadata at low cost

bull Desire to shrink the model without loss of precision

ndash Too large for application (gt100MB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 67

References

bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定

bull [Okanohara+ 09] Text Categorization with All Substring Features

bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs

bull [Cavnar+ 94] N-Gram-Based Text Categorization

bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 68

Page 37: Short Text Language Detection with Infinity-Gram

How to annotate

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 39

Swedish Norwegian Danish Vietnamese Lithuanian

Czech Hungarian Catalan Rumanian and Polish guides in turn

Created Corpus

bull Noiseless tweets for training data

bull Noiseful tweets with more than 3 words as test data

bull Work with Rauacutel Velaz and Hiroshi Manabe for Catalan corpus creation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 40

language training testca Catalan 9089 5082cs Czech 9082 7682da Dannish 7388 5524de German 44448 10065en English 44520 10168es Spanish 44118 10265fi Finnish 8087 7050fr French 44339 10098hu Hungarian 10030 4904id Indonesian 44722 10181it Italian 43366 10152nl Dutch 44682 10007no Norwegian 10124 8496pl Polish 16771 10152pt Portuguese 44215 10208ro Romanian 10021 5911sv Swedish 44054 10032tr Turkish 44703 10308vi Vietnamese 15030 10488

total 538789 166773

Simple Language Detection

bull Language detector can be constructed

from maximal substring model and

twitter corpus

ndash It still gets at most 98 accuracy

bull We guess it is necessary to reduce bias

ndash data size bias

ndash language-specific bias

ndash twitter-specific bias

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 41

Bias by Data Size

bull Tweet size in each language has huge bias

bull Level them out by sampling with replacement from each language up to the largest data

ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement

English

Portuguese

Spanish

Indonesian

Dutch

French

German

Turkish

Italian

Swedish

othersShort Text Language Detection with Infinity-Gram

(NAIST Seminar) 42

Convert to Lowercase on Multiple Languages

bull Conversion into lower case saves corpus and compresses model

bull But the lower case of I (U+0049) in Turkish differs from others

bull Convert to lower case excluding lsquoIrsquo

Upper case Lower case

Turkish

Azerbaijani

I (U+0049) ı (U+0131)

İ (U+0130) i (U+0069)

Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 43

Normalization for Rumanian

bull Rumanian uses acirc ă icirc ș ț in addition to a-z

bull There are 2 character type as st with a ldquobeardrdquo

ndash U+015E-F U+0162-3 st with cedilla

ndash U+0218-B st with comma below

bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia

bull The 2 code has the same design in some fonts

ndash Indistinguishable

ș ş U+0219 U+015F

ț ţ U+021B U+0163

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

44

Rumanian Character Affairs on PC

bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently

ndash 1989 Democratization in Rumania

ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode

ndash 2007 Rumania seated in the EU

ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 45

lsquost with cedillarsquo is used

on an advertisement board

in Bucharest

Normalization for Substitute Characters

bull lsquost with cedillarsquo are substitute characters

ndash But they are more popular than the others

ndash with cedilla with comma = 2 1

ndash ldquoRumanian IMErdquo outputs the substitutes too D

bull Regard lsquost with commarsquo as lsquost with cedillarsquo

ț ţ U+021B U+0163

I reckon it is similar to the relationship of

Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 46

Arabic Character Normalization (on language-detection)

bull Arabic and Persian have the similar trouble too

bull Character lsquoyehrsquo in Farsi corresponds to 2 code points

ndash Wikipedia uses ی (U+06cc Farsi yeh) only

ndash News uses ي(U+064a Arabic yeh) only

bull U+064a is a substitute in Farsi

ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc

ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails

bull Regard U+06cc as U+064a

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 47

Normalization for Vietnamese (1)

bull Vietnamese has 12 vowels

ndash a ă acirc e ecirc i y o ocirc ơ u ư

bull Vietnamese has 6 tones

ndash a ả agrave atilde aacute ạ

ndash These tone symbols are used also in general documents like news

bull The tone symbols can be appended to all vowels

ndash 12 6 = 72

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 48

Normalization for Vietnamese (2)

bull Representation of vowels with

tones

1 Use U+1ea0 - U+1ef9

bull ẵ = U+1eb5

2 Combine with Diacritical Marks

bull ẵ = U+0103 U+0303

ndash Half and half on news and tweet

bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 49

CJK-Kanji Normalization (1) (on language-detection)

bull CJK-Kanji has too many characters(more than 20K)

ndash Other character types have only 30-50 characters

bull The character space is very sparse

ndash Characters that donrsquot occur in the training corpus have no probabilities

bull eg 谢谢 Kanji for person name

ndash Common frequent characters are too strong

bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese

bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 50

CJK-Kanji Normalization (2) (on language-detection)

bull Group Kanjis by frequency and normalize each group to the representative character

ndash (1) K-means clustering

bull Use tf-idf on Wikipedia and Google News

bull K=50 (size of ascii alphabet = 52)

ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese

bull Simplified Chinese 现代汉语常用字表(3500)

bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)

bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998

ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much

bull Generate 130 clusters from product of (1) and (2)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 51

Normalization for twitter

bull Remove simply

ndash URL

ndash mention

ndash hash tag

ndash RT

ndash face mark using alphabet like XD p

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 52

Normalization for twitter-Specific Representation

bull How to Like lsquocoooooooollllllrsquo

bull Case 1 Make a normalization dictionary using [Brody+ 2011]

ndash Unsupervised normalization like coooollll rarr cool

ndash It canrsquot handle words that are not in the dictionary

bull Case 2 If the same character continues in more than 3 Shrink it to 2

ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of

bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on

bull Acronym (like WWW СССР) is not useful for language detection

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 53

Laugh Normalization

bull There are various laughs on each language

ndash HOW MUCH DO YOU LOVE COACH BEISTE

HHAHAHAHAHAH

ndash Hihihihi ) Habe ich regulaumlr 2x die Woche

ndash Tafil con eso Jajajajajajaja

ndash Malo Jejejeje XP

ndash kekeke chỗ đoacute lagravem aacuteo được ko em

bull Shrink them to double

ndash hahahha rArr haha

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 54

Implementation and Estimation

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 55

Language Detection with Infinity-Gram (ldig)

bull tweet language detection for Latin

alphabet

ndash httpsgithubcomshuyoldig

bull MIT license

bull Distribute also the trained model here

ndash infin-gram LR(maximal substring) [Okanohara+ 09]

ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]

ndash Double Array

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 56

Usage (1) Model Initialization

bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]

ndash Extract features from corpus and initialize model

ndash -m model directory

ndash -x path of maximal substring extractor (execute as external process)

ndash --ff Ignore less than the specified value

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 57

Maximal String Extractor

bull maxsubst [input file] [output file]

ndash Input as multiple line text

bull Replace TABs to ldquo ldquo line feeds to U+0001 in it

ndash Output as rdquo[features]yent[frequency]rdquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 58

Usage (2) Learn

bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]

ndash Learn the model using the corpus on 1 cycle of SGD

ndash -e learning rate of SGD

ndash -r regularizer of L1 regularization

ndash --wr what times to regularize for whole parameters

bull Parameters are too many to regularize the whle ones every step

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 59

Usage (3) Shrink Model

bull ldigpy -m [model] --shrink

ndash Remove Unefficient features(all

parameters of which are 0) from the

model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 60

Usage (4) Detect Language

bull ldigpy -m [model] [test data]

ndash Detect languages of test data and output

its result and summary

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 61

Data Format

bull Training and test data

ndash [correct label]yent[meta data]yent[text]

en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD

62

Usage (5) Estimation Tool

bull serverpy -m [model] -p [port number]

ndash Open httplocalhost[port] after it is executed

ndash Output their language probabilities contained features and their parameters for a text inputed in the text area

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 63

Estimation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus

As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size

64

language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992

total 166711 165053 9901 922 974

Estimation for LIGA dataset

bull Estimate using LIGA[Tromp+ 11] dataset

with 9066 tweets for 6 languages

ndash httpwwwwintuenl~mpechenprojectssmm

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Use 19 language model

65

Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996

total 9066 8992 992

Estimation for Europarl Dataset

Only supported languages for ldig

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 66

ldig langdetect CLDlanguage size correct rate correct rate correct rate

bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993

total 21000 13957 997 20850 993 20814 991

Conclusions

bull Language detector using maximal substring model

ndash Detect over 99 accuracy for 19 languages

ndash langdetect with tweet corpus even has 97 accuracy

bull If the corpus is maintained the precision will be still up

ndash There are still many mistakes (in particular da and no)

bull If metadata is added to features the precision will be still up

ndash How to add and train metadata at low cost

bull Desire to shrink the model without loss of precision

ndash Too large for application (gt100MB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 67

References

bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定

bull [Okanohara+ 09] Text Categorization with All Substring Features

bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs

bull [Cavnar+ 94] N-Gram-Based Text Categorization

bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 68

Page 38: Short Text Language Detection with Infinity-Gram

Created Corpus

bull Noiseless tweets for training data

bull Noiseful tweets with more than 3 words as test data

bull Work with Rauacutel Velaz and Hiroshi Manabe for Catalan corpus creation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 40

language training testca Catalan 9089 5082cs Czech 9082 7682da Dannish 7388 5524de German 44448 10065en English 44520 10168es Spanish 44118 10265fi Finnish 8087 7050fr French 44339 10098hu Hungarian 10030 4904id Indonesian 44722 10181it Italian 43366 10152nl Dutch 44682 10007no Norwegian 10124 8496pl Polish 16771 10152pt Portuguese 44215 10208ro Romanian 10021 5911sv Swedish 44054 10032tr Turkish 44703 10308vi Vietnamese 15030 10488

total 538789 166773

Simple Language Detection

bull Language detector can be constructed

from maximal substring model and

twitter corpus

ndash It still gets at most 98 accuracy

bull We guess it is necessary to reduce bias

ndash data size bias

ndash language-specific bias

ndash twitter-specific bias

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 41

Bias by Data Size

bull Tweet size in each language has huge bias

bull Level them out by sampling with replacement from each language up to the largest data

ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement

English

Portuguese

Spanish

Indonesian

Dutch

French

German

Turkish

Italian

Swedish

othersShort Text Language Detection with Infinity-Gram

(NAIST Seminar) 42

Convert to Lowercase on Multiple Languages

bull Conversion into lower case saves corpus and compresses model

bull But the lower case of I (U+0049) in Turkish differs from others

bull Convert to lower case excluding lsquoIrsquo

Upper case Lower case

Turkish

Azerbaijani

I (U+0049) ı (U+0131)

İ (U+0130) i (U+0069)

Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 43

Normalization for Rumanian

bull Rumanian uses acirc ă icirc ș ț in addition to a-z

bull There are 2 character type as st with a ldquobeardrdquo

ndash U+015E-F U+0162-3 st with cedilla

ndash U+0218-B st with comma below

bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia

bull The 2 code has the same design in some fonts

ndash Indistinguishable

ș ş U+0219 U+015F

ț ţ U+021B U+0163

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

44

Rumanian Character Affairs on PC

bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently

ndash 1989 Democratization in Rumania

ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode

ndash 2007 Rumania seated in the EU

ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 45

lsquost with cedillarsquo is used

on an advertisement board

in Bucharest

Normalization for Substitute Characters

bull lsquost with cedillarsquo are substitute characters

ndash But they are more popular than the others

ndash with cedilla with comma = 2 1

ndash ldquoRumanian IMErdquo outputs the substitutes too D

bull Regard lsquost with commarsquo as lsquost with cedillarsquo

ț ţ U+021B U+0163

I reckon it is similar to the relationship of

Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 46

Arabic Character Normalization (on language-detection)

bull Arabic and Persian have the similar trouble too

bull Character lsquoyehrsquo in Farsi corresponds to 2 code points

ndash Wikipedia uses ی (U+06cc Farsi yeh) only

ndash News uses ي(U+064a Arabic yeh) only

bull U+064a is a substitute in Farsi

ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc

ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails

bull Regard U+06cc as U+064a

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 47

Normalization for Vietnamese (1)

bull Vietnamese has 12 vowels

ndash a ă acirc e ecirc i y o ocirc ơ u ư

bull Vietnamese has 6 tones

ndash a ả agrave atilde aacute ạ

ndash These tone symbols are used also in general documents like news

bull The tone symbols can be appended to all vowels

ndash 12 6 = 72

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 48

Normalization for Vietnamese (2)

bull Representation of vowels with

tones

1 Use U+1ea0 - U+1ef9

bull ẵ = U+1eb5

2 Combine with Diacritical Marks

bull ẵ = U+0103 U+0303

ndash Half and half on news and tweet

bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 49

CJK-Kanji Normalization (1) (on language-detection)

bull CJK-Kanji has too many characters(more than 20K)

ndash Other character types have only 30-50 characters

bull The character space is very sparse

ndash Characters that donrsquot occur in the training corpus have no probabilities

bull eg 谢谢 Kanji for person name

ndash Common frequent characters are too strong

bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese

bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 50

CJK-Kanji Normalization (2) (on language-detection)

bull Group Kanjis by frequency and normalize each group to the representative character

ndash (1) K-means clustering

bull Use tf-idf on Wikipedia and Google News

bull K=50 (size of ascii alphabet = 52)

ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese

bull Simplified Chinese 现代汉语常用字表(3500)

bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)

bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998

ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much

bull Generate 130 clusters from product of (1) and (2)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 51

Normalization for twitter

bull Remove simply

ndash URL

ndash mention

ndash hash tag

ndash RT

ndash face mark using alphabet like XD p

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 52

Normalization for twitter-Specific Representation

bull How to Like lsquocoooooooollllllrsquo

bull Case 1 Make a normalization dictionary using [Brody+ 2011]

ndash Unsupervised normalization like coooollll rarr cool

ndash It canrsquot handle words that are not in the dictionary

bull Case 2 If the same character continues in more than 3 Shrink it to 2

ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of

bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on

bull Acronym (like WWW СССР) is not useful for language detection

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 53

Laugh Normalization

bull There are various laughs on each language

ndash HOW MUCH DO YOU LOVE COACH BEISTE

HHAHAHAHAHAH

ndash Hihihihi ) Habe ich regulaumlr 2x die Woche

ndash Tafil con eso Jajajajajajaja

ndash Malo Jejejeje XP

ndash kekeke chỗ đoacute lagravem aacuteo được ko em

bull Shrink them to double

ndash hahahha rArr haha

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 54

Implementation and Estimation

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 55

Language Detection with Infinity-Gram (ldig)

bull tweet language detection for Latin

alphabet

ndash httpsgithubcomshuyoldig

bull MIT license

bull Distribute also the trained model here

ndash infin-gram LR(maximal substring) [Okanohara+ 09]

ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]

ndash Double Array

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 56

Usage (1) Model Initialization

bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]

ndash Extract features from corpus and initialize model

ndash -m model directory

ndash -x path of maximal substring extractor (execute as external process)

ndash --ff Ignore less than the specified value

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 57

Maximal String Extractor

bull maxsubst [input file] [output file]

ndash Input as multiple line text

bull Replace TABs to ldquo ldquo line feeds to U+0001 in it

ndash Output as rdquo[features]yent[frequency]rdquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 58

Usage (2) Learn

bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]

ndash Learn the model using the corpus on 1 cycle of SGD

ndash -e learning rate of SGD

ndash -r regularizer of L1 regularization

ndash --wr what times to regularize for whole parameters

bull Parameters are too many to regularize the whle ones every step

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 59

Usage (3) Shrink Model

bull ldigpy -m [model] --shrink

ndash Remove Unefficient features(all

parameters of which are 0) from the

model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 60

Usage (4) Detect Language

bull ldigpy -m [model] [test data]

ndash Detect languages of test data and output

its result and summary

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 61

Data Format

bull Training and test data

ndash [correct label]yent[meta data]yent[text]

en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD

62

Usage (5) Estimation Tool

bull serverpy -m [model] -p [port number]

ndash Open httplocalhost[port] after it is executed

ndash Output their language probabilities contained features and their parameters for a text inputed in the text area

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 63

Estimation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus

As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size

64

language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992

total 166711 165053 9901 922 974

Estimation for LIGA dataset

bull Estimate using LIGA[Tromp+ 11] dataset

with 9066 tweets for 6 languages

ndash httpwwwwintuenl~mpechenprojectssmm

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Use 19 language model

65

Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996

total 9066 8992 992

Estimation for Europarl Dataset

Only supported languages for ldig

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 66

ldig langdetect CLDlanguage size correct rate correct rate correct rate

bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993

total 21000 13957 997 20850 993 20814 991

Conclusions

bull Language detector using maximal substring model

ndash Detect over 99 accuracy for 19 languages

ndash langdetect with tweet corpus even has 97 accuracy

bull If the corpus is maintained the precision will be still up

ndash There are still many mistakes (in particular da and no)

bull If metadata is added to features the precision will be still up

ndash How to add and train metadata at low cost

bull Desire to shrink the model without loss of precision

ndash Too large for application (gt100MB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 67

References

bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定

bull [Okanohara+ 09] Text Categorization with All Substring Features

bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs

bull [Cavnar+ 94] N-Gram-Based Text Categorization

bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 68

Page 39: Short Text Language Detection with Infinity-Gram

Simple Language Detection

bull Language detector can be constructed

from maximal substring model and

twitter corpus

ndash It still gets at most 98 accuracy

bull We guess it is necessary to reduce bias

ndash data size bias

ndash language-specific bias

ndash twitter-specific bias

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 41

Bias by Data Size

bull Tweet size in each language has huge bias

bull Level them out by sampling with replacement from each language up to the largest data

ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement

English

Portuguese

Spanish

Indonesian

Dutch

French

German

Turkish

Italian

Swedish

othersShort Text Language Detection with Infinity-Gram

(NAIST Seminar) 42

Convert to Lowercase on Multiple Languages

bull Conversion into lower case saves corpus and compresses model

bull But the lower case of I (U+0049) in Turkish differs from others

bull Convert to lower case excluding lsquoIrsquo

Upper case Lower case

Turkish

Azerbaijani

I (U+0049) ı (U+0131)

İ (U+0130) i (U+0069)

Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 43

Normalization for Rumanian

bull Rumanian uses acirc ă icirc ș ț in addition to a-z

bull There are 2 character type as st with a ldquobeardrdquo

ndash U+015E-F U+0162-3 st with cedilla

ndash U+0218-B st with comma below

bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia

bull The 2 code has the same design in some fonts

ndash Indistinguishable

ș ş U+0219 U+015F

ț ţ U+021B U+0163

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

44

Rumanian Character Affairs on PC

bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently

ndash 1989 Democratization in Rumania

ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode

ndash 2007 Rumania seated in the EU

ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 45

lsquost with cedillarsquo is used

on an advertisement board

in Bucharest

Normalization for Substitute Characters

bull lsquost with cedillarsquo are substitute characters

ndash But they are more popular than the others

ndash with cedilla with comma = 2 1

ndash ldquoRumanian IMErdquo outputs the substitutes too D

bull Regard lsquost with commarsquo as lsquost with cedillarsquo

ț ţ U+021B U+0163

I reckon it is similar to the relationship of

Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 46

Arabic Character Normalization (on language-detection)

bull Arabic and Persian have the similar trouble too

bull Character lsquoyehrsquo in Farsi corresponds to 2 code points

ndash Wikipedia uses ی (U+06cc Farsi yeh) only

ndash News uses ي(U+064a Arabic yeh) only

bull U+064a is a substitute in Farsi

ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc

ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails

bull Regard U+06cc as U+064a

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 47

Normalization for Vietnamese (1)

bull Vietnamese has 12 vowels

ndash a ă acirc e ecirc i y o ocirc ơ u ư

bull Vietnamese has 6 tones

ndash a ả agrave atilde aacute ạ

ndash These tone symbols are used also in general documents like news

bull The tone symbols can be appended to all vowels

ndash 12 6 = 72

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 48

Normalization for Vietnamese (2)

bull Representation of vowels with

tones

1 Use U+1ea0 - U+1ef9

bull ẵ = U+1eb5

2 Combine with Diacritical Marks

bull ẵ = U+0103 U+0303

ndash Half and half on news and tweet

bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 49

CJK-Kanji Normalization (1) (on language-detection)

bull CJK-Kanji has too many characters(more than 20K)

ndash Other character types have only 30-50 characters

bull The character space is very sparse

ndash Characters that donrsquot occur in the training corpus have no probabilities

bull eg 谢谢 Kanji for person name

ndash Common frequent characters are too strong

bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese

bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 50

CJK-Kanji Normalization (2) (on language-detection)

bull Group Kanjis by frequency and normalize each group to the representative character

ndash (1) K-means clustering

bull Use tf-idf on Wikipedia and Google News

bull K=50 (size of ascii alphabet = 52)

ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese

bull Simplified Chinese 现代汉语常用字表(3500)

bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)

bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998

ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much

bull Generate 130 clusters from product of (1) and (2)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 51

Normalization for twitter

bull Remove simply

ndash URL

ndash mention

ndash hash tag

ndash RT

ndash face mark using alphabet like XD p

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 52

Normalization for twitter-Specific Representation

bull How to Like lsquocoooooooollllllrsquo

bull Case 1 Make a normalization dictionary using [Brody+ 2011]

ndash Unsupervised normalization like coooollll rarr cool

ndash It canrsquot handle words that are not in the dictionary

bull Case 2 If the same character continues in more than 3 Shrink it to 2

ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of

bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on

bull Acronym (like WWW СССР) is not useful for language detection

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 53

Laugh Normalization

bull There are various laughs on each language

ndash HOW MUCH DO YOU LOVE COACH BEISTE

HHAHAHAHAHAH

ndash Hihihihi ) Habe ich regulaumlr 2x die Woche

ndash Tafil con eso Jajajajajajaja

ndash Malo Jejejeje XP

ndash kekeke chỗ đoacute lagravem aacuteo được ko em

bull Shrink them to double

ndash hahahha rArr haha

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 54

Implementation and Estimation

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 55

Language Detection with Infinity-Gram (ldig)

bull tweet language detection for Latin

alphabet

ndash httpsgithubcomshuyoldig

bull MIT license

bull Distribute also the trained model here

ndash infin-gram LR(maximal substring) [Okanohara+ 09]

ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]

ndash Double Array

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 56

Usage (1) Model Initialization

bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]

ndash Extract features from corpus and initialize model

ndash -m model directory

ndash -x path of maximal substring extractor (execute as external process)

ndash --ff Ignore less than the specified value

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 57

Maximal String Extractor

bull maxsubst [input file] [output file]

ndash Input as multiple line text

bull Replace TABs to ldquo ldquo line feeds to U+0001 in it

ndash Output as rdquo[features]yent[frequency]rdquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 58

Usage (2) Learn

bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]

ndash Learn the model using the corpus on 1 cycle of SGD

ndash -e learning rate of SGD

ndash -r regularizer of L1 regularization

ndash --wr what times to regularize for whole parameters

bull Parameters are too many to regularize the whle ones every step

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 59

Usage (3) Shrink Model

bull ldigpy -m [model] --shrink

ndash Remove Unefficient features(all

parameters of which are 0) from the

model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 60

Usage (4) Detect Language

bull ldigpy -m [model] [test data]

ndash Detect languages of test data and output

its result and summary

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 61

Data Format

bull Training and test data

ndash [correct label]yent[meta data]yent[text]

en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD

62

Usage (5) Estimation Tool

bull serverpy -m [model] -p [port number]

ndash Open httplocalhost[port] after it is executed

ndash Output their language probabilities contained features and their parameters for a text inputed in the text area

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 63

Estimation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus

As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size

64

language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992

total 166711 165053 9901 922 974

Estimation for LIGA dataset

bull Estimate using LIGA[Tromp+ 11] dataset

with 9066 tweets for 6 languages

ndash httpwwwwintuenl~mpechenprojectssmm

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Use 19 language model

65

Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996

total 9066 8992 992

Estimation for Europarl Dataset

Only supported languages for ldig

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 66

ldig langdetect CLDlanguage size correct rate correct rate correct rate

bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993

total 21000 13957 997 20850 993 20814 991

Conclusions

bull Language detector using maximal substring model

ndash Detect over 99 accuracy for 19 languages

ndash langdetect with tweet corpus even has 97 accuracy

bull If the corpus is maintained the precision will be still up

ndash There are still many mistakes (in particular da and no)

bull If metadata is added to features the precision will be still up

ndash How to add and train metadata at low cost

bull Desire to shrink the model without loss of precision

ndash Too large for application (gt100MB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 67

References

bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定

bull [Okanohara+ 09] Text Categorization with All Substring Features

bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs

bull [Cavnar+ 94] N-Gram-Based Text Categorization

bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 68

Page 40: Short Text Language Detection with Infinity-Gram

Bias by Data Size

bull Tweet size in each language has huge bias

bull Level them out by sampling with replacement from each language up to the largest data

ndash It actually approximates to copy the integer multiple of data and sample the rest without replacement

English

Portuguese

Spanish

Indonesian

Dutch

French

German

Turkish

Italian

Swedish

othersShort Text Language Detection with Infinity-Gram

(NAIST Seminar) 42

Convert to Lowercase on Multiple Languages

bull Conversion into lower case saves corpus and compresses model

bull But the lower case of I (U+0049) in Turkish differs from others

bull Convert to lower case excluding lsquoIrsquo

Upper case Lower case

Turkish

Azerbaijani

I (U+0049) ı (U+0131)

İ (U+0130) i (U+0069)

Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 43

Normalization for Rumanian

bull Rumanian uses acirc ă icirc ș ț in addition to a-z

bull There are 2 character type as st with a ldquobeardrdquo

ndash U+015E-F U+0162-3 st with cedilla

ndash U+0218-B st with comma below

bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia

bull The 2 code has the same design in some fonts

ndash Indistinguishable

ș ş U+0219 U+015F

ț ţ U+021B U+0163

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

44

Rumanian Character Affairs on PC

bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently

ndash 1989 Democratization in Rumania

ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode

ndash 2007 Rumania seated in the EU

ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 45

lsquost with cedillarsquo is used

on an advertisement board

in Bucharest

Normalization for Substitute Characters

bull lsquost with cedillarsquo are substitute characters

ndash But they are more popular than the others

ndash with cedilla with comma = 2 1

ndash ldquoRumanian IMErdquo outputs the substitutes too D

bull Regard lsquost with commarsquo as lsquost with cedillarsquo

ț ţ U+021B U+0163

I reckon it is similar to the relationship of

Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 46

Arabic Character Normalization (on language-detection)

bull Arabic and Persian have the similar trouble too

bull Character lsquoyehrsquo in Farsi corresponds to 2 code points

ndash Wikipedia uses ی (U+06cc Farsi yeh) only

ndash News uses ي(U+064a Arabic yeh) only

bull U+064a is a substitute in Farsi

ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc

ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails

bull Regard U+06cc as U+064a

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 47

Normalization for Vietnamese (1)

bull Vietnamese has 12 vowels

ndash a ă acirc e ecirc i y o ocirc ơ u ư

bull Vietnamese has 6 tones

ndash a ả agrave atilde aacute ạ

ndash These tone symbols are used also in general documents like news

bull The tone symbols can be appended to all vowels

ndash 12 6 = 72

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 48

Normalization for Vietnamese (2)

bull Representation of vowels with

tones

1 Use U+1ea0 - U+1ef9

bull ẵ = U+1eb5

2 Combine with Diacritical Marks

bull ẵ = U+0103 U+0303

ndash Half and half on news and tweet

bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 49

CJK-Kanji Normalization (1) (on language-detection)

bull CJK-Kanji has too many characters(more than 20K)

ndash Other character types have only 30-50 characters

bull The character space is very sparse

ndash Characters that donrsquot occur in the training corpus have no probabilities

bull eg 谢谢 Kanji for person name

ndash Common frequent characters are too strong

bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese

bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 50

CJK-Kanji Normalization (2) (on language-detection)

bull Group Kanjis by frequency and normalize each group to the representative character

ndash (1) K-means clustering

bull Use tf-idf on Wikipedia and Google News

bull K=50 (size of ascii alphabet = 52)

ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese

bull Simplified Chinese 现代汉语常用字表(3500)

bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)

bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998

ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much

bull Generate 130 clusters from product of (1) and (2)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 51

Normalization for twitter

bull Remove simply

ndash URL

ndash mention

ndash hash tag

ndash RT

ndash face mark using alphabet like XD p

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 52

Normalization for twitter-Specific Representation

bull How to Like lsquocoooooooollllllrsquo

bull Case 1 Make a normalization dictionary using [Brody+ 2011]

ndash Unsupervised normalization like coooollll rarr cool

ndash It canrsquot handle words that are not in the dictionary

bull Case 2 If the same character continues in more than 3 Shrink it to 2

ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of

bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on

bull Acronym (like WWW СССР) is not useful for language detection

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 53

Laugh Normalization

bull There are various laughs on each language

ndash HOW MUCH DO YOU LOVE COACH BEISTE

HHAHAHAHAHAH

ndash Hihihihi ) Habe ich regulaumlr 2x die Woche

ndash Tafil con eso Jajajajajajaja

ndash Malo Jejejeje XP

ndash kekeke chỗ đoacute lagravem aacuteo được ko em

bull Shrink them to double

ndash hahahha rArr haha

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 54

Implementation and Estimation

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 55

Language Detection with Infinity-Gram (ldig)

bull tweet language detection for Latin

alphabet

ndash httpsgithubcomshuyoldig

bull MIT license

bull Distribute also the trained model here

ndash infin-gram LR(maximal substring) [Okanohara+ 09]

ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]

ndash Double Array

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 56

Usage (1) Model Initialization

bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]

ndash Extract features from corpus and initialize model

ndash -m model directory

ndash -x path of maximal substring extractor (execute as external process)

ndash --ff Ignore less than the specified value

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 57

Maximal String Extractor

bull maxsubst [input file] [output file]

ndash Input as multiple line text

bull Replace TABs to ldquo ldquo line feeds to U+0001 in it

ndash Output as rdquo[features]yent[frequency]rdquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 58

Usage (2) Learn

bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]

ndash Learn the model using the corpus on 1 cycle of SGD

ndash -e learning rate of SGD

ndash -r regularizer of L1 regularization

ndash --wr what times to regularize for whole parameters

bull Parameters are too many to regularize the whle ones every step

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 59

Usage (3) Shrink Model

bull ldigpy -m [model] --shrink

ndash Remove Unefficient features(all

parameters of which are 0) from the

model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 60

Usage (4) Detect Language

bull ldigpy -m [model] [test data]

ndash Detect languages of test data and output

its result and summary

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 61

Data Format

bull Training and test data

ndash [correct label]yent[meta data]yent[text]

en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD

62

Usage (5) Estimation Tool

bull serverpy -m [model] -p [port number]

ndash Open httplocalhost[port] after it is executed

ndash Output their language probabilities contained features and their parameters for a text inputed in the text area

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 63

Estimation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus

As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size

64

language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992

total 166711 165053 9901 922 974

Estimation for LIGA dataset

bull Estimate using LIGA[Tromp+ 11] dataset

with 9066 tweets for 6 languages

ndash httpwwwwintuenl~mpechenprojectssmm

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Use 19 language model

65

Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996

total 9066 8992 992

Estimation for Europarl Dataset

Only supported languages for ldig

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 66

ldig langdetect CLDlanguage size correct rate correct rate correct rate

bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993

total 21000 13957 997 20850 993 20814 991

Conclusions

bull Language detector using maximal substring model

ndash Detect over 99 accuracy for 19 languages

ndash langdetect with tweet corpus even has 97 accuracy

bull If the corpus is maintained the precision will be still up

ndash There are still many mistakes (in particular da and no)

bull If metadata is added to features the precision will be still up

ndash How to add and train metadata at low cost

bull Desire to shrink the model without loss of precision

ndash Too large for application (gt100MB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 67

References

bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定

bull [Okanohara+ 09] Text Categorization with All Substring Features

bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs

bull [Cavnar+ 94] N-Gram-Based Text Categorization

bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 68

Page 41: Short Text Language Detection with Infinity-Gram

Convert to Lowercase on Multiple Languages

bull Conversion into lower case saves corpus and compresses model

bull But the lower case of I (U+0049) in Turkish differs from others

bull Convert to lower case excluding lsquoIrsquo

Upper case Lower case

Turkish

Azerbaijani

I (U+0049) ı (U+0131)

İ (U+0130) i (U+0069)

Others I (U+0049) i (U+0069) Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 43

Normalization for Rumanian

bull Rumanian uses acirc ă icirc ș ț in addition to a-z

bull There are 2 character type as st with a ldquobeardrdquo

ndash U+015E-F U+0162-3 st with cedilla

ndash U+0218-B st with comma below

bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia

bull The 2 code has the same design in some fonts

ndash Indistinguishable

ș ş U+0219 U+015F

ț ţ U+021B U+0163

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

44

Rumanian Character Affairs on PC

bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently

ndash 1989 Democratization in Rumania

ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode

ndash 2007 Rumania seated in the EU

ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 45

lsquost with cedillarsquo is used

on an advertisement board

in Bucharest

Normalization for Substitute Characters

bull lsquost with cedillarsquo are substitute characters

ndash But they are more popular than the others

ndash with cedilla with comma = 2 1

ndash ldquoRumanian IMErdquo outputs the substitutes too D

bull Regard lsquost with commarsquo as lsquost with cedillarsquo

ț ţ U+021B U+0163

I reckon it is similar to the relationship of

Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 46

Arabic Character Normalization (on language-detection)

bull Arabic and Persian have the similar trouble too

bull Character lsquoyehrsquo in Farsi corresponds to 2 code points

ndash Wikipedia uses ی (U+06cc Farsi yeh) only

ndash News uses ي(U+064a Arabic yeh) only

bull U+064a is a substitute in Farsi

ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc

ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails

bull Regard U+06cc as U+064a

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 47

Normalization for Vietnamese (1)

bull Vietnamese has 12 vowels

ndash a ă acirc e ecirc i y o ocirc ơ u ư

bull Vietnamese has 6 tones

ndash a ả agrave atilde aacute ạ

ndash These tone symbols are used also in general documents like news

bull The tone symbols can be appended to all vowels

ndash 12 6 = 72

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 48

Normalization for Vietnamese (2)

bull Representation of vowels with

tones

1 Use U+1ea0 - U+1ef9

bull ẵ = U+1eb5

2 Combine with Diacritical Marks

bull ẵ = U+0103 U+0303

ndash Half and half on news and tweet

bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 49

CJK-Kanji Normalization (1) (on language-detection)

bull CJK-Kanji has too many characters(more than 20K)

ndash Other character types have only 30-50 characters

bull The character space is very sparse

ndash Characters that donrsquot occur in the training corpus have no probabilities

bull eg 谢谢 Kanji for person name

ndash Common frequent characters are too strong

bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese

bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 50

CJK-Kanji Normalization (2) (on language-detection)

bull Group Kanjis by frequency and normalize each group to the representative character

ndash (1) K-means clustering

bull Use tf-idf on Wikipedia and Google News

bull K=50 (size of ascii alphabet = 52)

ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese

bull Simplified Chinese 现代汉语常用字表(3500)

bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)

bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998

ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much

bull Generate 130 clusters from product of (1) and (2)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 51

Normalization for twitter

bull Remove simply

ndash URL

ndash mention

ndash hash tag

ndash RT

ndash face mark using alphabet like XD p

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 52

Normalization for twitter-Specific Representation

bull How to Like lsquocoooooooollllllrsquo

bull Case 1 Make a normalization dictionary using [Brody+ 2011]

ndash Unsupervised normalization like coooollll rarr cool

ndash It canrsquot handle words that are not in the dictionary

bull Case 2 If the same character continues in more than 3 Shrink it to 2

ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of

bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on

bull Acronym (like WWW СССР) is not useful for language detection

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 53

Laugh Normalization

bull There are various laughs on each language

ndash HOW MUCH DO YOU LOVE COACH BEISTE

HHAHAHAHAHAH

ndash Hihihihi ) Habe ich regulaumlr 2x die Woche

ndash Tafil con eso Jajajajajajaja

ndash Malo Jejejeje XP

ndash kekeke chỗ đoacute lagravem aacuteo được ko em

bull Shrink them to double

ndash hahahha rArr haha

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 54

Implementation and Estimation

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 55

Language Detection with Infinity-Gram (ldig)

bull tweet language detection for Latin

alphabet

ndash httpsgithubcomshuyoldig

bull MIT license

bull Distribute also the trained model here

ndash infin-gram LR(maximal substring) [Okanohara+ 09]

ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]

ndash Double Array

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 56

Usage (1) Model Initialization

bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]

ndash Extract features from corpus and initialize model

ndash -m model directory

ndash -x path of maximal substring extractor (execute as external process)

ndash --ff Ignore less than the specified value

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 57

Maximal String Extractor

bull maxsubst [input file] [output file]

ndash Input as multiple line text

bull Replace TABs to ldquo ldquo line feeds to U+0001 in it

ndash Output as rdquo[features]yent[frequency]rdquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 58

Usage (2) Learn

bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]

ndash Learn the model using the corpus on 1 cycle of SGD

ndash -e learning rate of SGD

ndash -r regularizer of L1 regularization

ndash --wr what times to regularize for whole parameters

bull Parameters are too many to regularize the whle ones every step

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 59

Usage (3) Shrink Model

bull ldigpy -m [model] --shrink

ndash Remove Unefficient features(all

parameters of which are 0) from the

model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 60

Usage (4) Detect Language

bull ldigpy -m [model] [test data]

ndash Detect languages of test data and output

its result and summary

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 61

Data Format

bull Training and test data

ndash [correct label]yent[meta data]yent[text]

en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD

62

Usage (5) Estimation Tool

bull serverpy -m [model] -p [port number]

ndash Open httplocalhost[port] after it is executed

ndash Output their language probabilities contained features and their parameters for a text inputed in the text area

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 63

Estimation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus

As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size

64

language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992

total 166711 165053 9901 922 974

Estimation for LIGA dataset

bull Estimate using LIGA[Tromp+ 11] dataset

with 9066 tweets for 6 languages

ndash httpwwwwintuenl~mpechenprojectssmm

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Use 19 language model

65

Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996

total 9066 8992 992

Estimation for Europarl Dataset

Only supported languages for ldig

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 66

ldig langdetect CLDlanguage size correct rate correct rate correct rate

bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993

total 21000 13957 997 20850 993 20814 991

Conclusions

bull Language detector using maximal substring model

ndash Detect over 99 accuracy for 19 languages

ndash langdetect with tweet corpus even has 97 accuracy

bull If the corpus is maintained the precision will be still up

ndash There are still many mistakes (in particular da and no)

bull If metadata is added to features the precision will be still up

ndash How to add and train metadata at low cost

bull Desire to shrink the model without loss of precision

ndash Too large for application (gt100MB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 67

References

bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定

bull [Okanohara+ 09] Text Categorization with All Substring Features

bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs

bull [Cavnar+ 94] N-Gram-Based Text Categorization

bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 68

Page 42: Short Text Language Detection with Infinity-Gram

Normalization for Rumanian

bull Rumanian uses acirc ă icirc ș ț in addition to a-z

bull There are 2 character type as st with a ldquobeardrdquo

ndash U+015E-F U+0162-3 st with cedilla

ndash U+0218-B st with comma below

bull lsquost with cedillarsquo is more popular on news twitter and Wikipedia

bull The 2 code has the same design in some fonts

ndash Indistinguishable

ș ş U+0219 U+015F

ț ţ U+021B U+0163

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

44

Rumanian Character Affairs on PC

bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently

ndash 1989 Democratization in Rumania

ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode

ndash 2007 Rumania seated in the EU

ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 45

lsquost with cedillarsquo is used

on an advertisement board

in Bucharest

Normalization for Substitute Characters

bull lsquost with cedillarsquo are substitute characters

ndash But they are more popular than the others

ndash with cedilla with comma = 2 1

ndash ldquoRumanian IMErdquo outputs the substitutes too D

bull Regard lsquost with commarsquo as lsquost with cedillarsquo

ț ţ U+021B U+0163

I reckon it is similar to the relationship of

Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 46

Arabic Character Normalization (on language-detection)

bull Arabic and Persian have the similar trouble too

bull Character lsquoyehrsquo in Farsi corresponds to 2 code points

ndash Wikipedia uses ی (U+06cc Farsi yeh) only

ndash News uses ي(U+064a Arabic yeh) only

bull U+064a is a substitute in Farsi

ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc

ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails

bull Regard U+06cc as U+064a

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 47

Normalization for Vietnamese (1)

bull Vietnamese has 12 vowels

ndash a ă acirc e ecirc i y o ocirc ơ u ư

bull Vietnamese has 6 tones

ndash a ả agrave atilde aacute ạ

ndash These tone symbols are used also in general documents like news

bull The tone symbols can be appended to all vowels

ndash 12 6 = 72

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 48

Normalization for Vietnamese (2)

bull Representation of vowels with

tones

1 Use U+1ea0 - U+1ef9

bull ẵ = U+1eb5

2 Combine with Diacritical Marks

bull ẵ = U+0103 U+0303

ndash Half and half on news and tweet

bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 49

CJK-Kanji Normalization (1) (on language-detection)

bull CJK-Kanji has too many characters(more than 20K)

ndash Other character types have only 30-50 characters

bull The character space is very sparse

ndash Characters that donrsquot occur in the training corpus have no probabilities

bull eg 谢谢 Kanji for person name

ndash Common frequent characters are too strong

bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese

bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 50

CJK-Kanji Normalization (2) (on language-detection)

bull Group Kanjis by frequency and normalize each group to the representative character

ndash (1) K-means clustering

bull Use tf-idf on Wikipedia and Google News

bull K=50 (size of ascii alphabet = 52)

ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese

bull Simplified Chinese 现代汉语常用字表(3500)

bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)

bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998

ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much

bull Generate 130 clusters from product of (1) and (2)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 51

Normalization for twitter

bull Remove simply

ndash URL

ndash mention

ndash hash tag

ndash RT

ndash face mark using alphabet like XD p

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 52

Normalization for twitter-Specific Representation

bull How to Like lsquocoooooooollllllrsquo

bull Case 1 Make a normalization dictionary using [Brody+ 2011]

ndash Unsupervised normalization like coooollll rarr cool

ndash It canrsquot handle words that are not in the dictionary

bull Case 2 If the same character continues in more than 3 Shrink it to 2

ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of

bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on

bull Acronym (like WWW СССР) is not useful for language detection

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 53

Laugh Normalization

bull There are various laughs on each language

ndash HOW MUCH DO YOU LOVE COACH BEISTE

HHAHAHAHAHAH

ndash Hihihihi ) Habe ich regulaumlr 2x die Woche

ndash Tafil con eso Jajajajajajaja

ndash Malo Jejejeje XP

ndash kekeke chỗ đoacute lagravem aacuteo được ko em

bull Shrink them to double

ndash hahahha rArr haha

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 54

Implementation and Estimation

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 55

Language Detection with Infinity-Gram (ldig)

bull tweet language detection for Latin

alphabet

ndash httpsgithubcomshuyoldig

bull MIT license

bull Distribute also the trained model here

ndash infin-gram LR(maximal substring) [Okanohara+ 09]

ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]

ndash Double Array

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 56

Usage (1) Model Initialization

bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]

ndash Extract features from corpus and initialize model

ndash -m model directory

ndash -x path of maximal substring extractor (execute as external process)

ndash --ff Ignore less than the specified value

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 57

Maximal String Extractor

bull maxsubst [input file] [output file]

ndash Input as multiple line text

bull Replace TABs to ldquo ldquo line feeds to U+0001 in it

ndash Output as rdquo[features]yent[frequency]rdquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 58

Usage (2) Learn

bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]

ndash Learn the model using the corpus on 1 cycle of SGD

ndash -e learning rate of SGD

ndash -r regularizer of L1 regularization

ndash --wr what times to regularize for whole parameters

bull Parameters are too many to regularize the whle ones every step

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 59

Usage (3) Shrink Model

bull ldigpy -m [model] --shrink

ndash Remove Unefficient features(all

parameters of which are 0) from the

model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 60

Usage (4) Detect Language

bull ldigpy -m [model] [test data]

ndash Detect languages of test data and output

its result and summary

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 61

Data Format

bull Training and test data

ndash [correct label]yent[meta data]yent[text]

en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD

62

Usage (5) Estimation Tool

bull serverpy -m [model] -p [port number]

ndash Open httplocalhost[port] after it is executed

ndash Output their language probabilities contained features and their parameters for a text inputed in the text area

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 63

Estimation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus

As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size

64

language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992

total 166711 165053 9901 922 974

Estimation for LIGA dataset

bull Estimate using LIGA[Tromp+ 11] dataset

with 9066 tweets for 6 languages

ndash httpwwwwintuenl~mpechenprojectssmm

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Use 19 language model

65

Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996

total 9066 8992 992

Estimation for Europarl Dataset

Only supported languages for ldig

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 66

ldig langdetect CLDlanguage size correct rate correct rate correct rate

bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993

total 21000 13957 997 20850 993 20814 991

Conclusions

bull Language detector using maximal substring model

ndash Detect over 99 accuracy for 19 languages

ndash langdetect with tweet corpus even has 97 accuracy

bull If the corpus is maintained the precision will be still up

ndash There are still many mistakes (in particular da and no)

bull If metadata is added to features the precision will be still up

ndash How to add and train metadata at low cost

bull Desire to shrink the model without loss of precision

ndash Too large for application (gt100MB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 67

References

bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定

bull [Okanohara+ 09] Text Categorization with All Substring Features

bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs

bull [Cavnar+ 94] N-Gram-Based Text Categorization

bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 68

Page 43: Short Text Language Detection with Infinity-Gram

Rumanian Character Affairs on PC

bull Although Romanian orthography provided that lsquost with commarsquo must be used they was not available to PC until recently

ndash 1989 Democratization in Rumania

ndash 2001 lsquost with commarsquo was provided by ISO8859-16(Latin-10) and Unicode

ndash 2007 Rumania seated in the EU

ndash 2007 Windows Vista supported lsquost with commarsquo (available for everyone)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 45

lsquost with cedillarsquo is used

on an advertisement board

in Bucharest

Normalization for Substitute Characters

bull lsquost with cedillarsquo are substitute characters

ndash But they are more popular than the others

ndash with cedilla with comma = 2 1

ndash ldquoRumanian IMErdquo outputs the substitutes too D

bull Regard lsquost with commarsquo as lsquost with cedillarsquo

ț ţ U+021B U+0163

I reckon it is similar to the relationship of

Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 46

Arabic Character Normalization (on language-detection)

bull Arabic and Persian have the similar trouble too

bull Character lsquoyehrsquo in Farsi corresponds to 2 code points

ndash Wikipedia uses ی (U+06cc Farsi yeh) only

ndash News uses ي(U+064a Arabic yeh) only

bull U+064a is a substitute in Farsi

ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc

ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails

bull Regard U+06cc as U+064a

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 47

Normalization for Vietnamese (1)

bull Vietnamese has 12 vowels

ndash a ă acirc e ecirc i y o ocirc ơ u ư

bull Vietnamese has 6 tones

ndash a ả agrave atilde aacute ạ

ndash These tone symbols are used also in general documents like news

bull The tone symbols can be appended to all vowels

ndash 12 6 = 72

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 48

Normalization for Vietnamese (2)

bull Representation of vowels with

tones

1 Use U+1ea0 - U+1ef9

bull ẵ = U+1eb5

2 Combine with Diacritical Marks

bull ẵ = U+0103 U+0303

ndash Half and half on news and tweet

bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 49

CJK-Kanji Normalization (1) (on language-detection)

bull CJK-Kanji has too many characters(more than 20K)

ndash Other character types have only 30-50 characters

bull The character space is very sparse

ndash Characters that donrsquot occur in the training corpus have no probabilities

bull eg 谢谢 Kanji for person name

ndash Common frequent characters are too strong

bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese

bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 50

CJK-Kanji Normalization (2) (on language-detection)

bull Group Kanjis by frequency and normalize each group to the representative character

ndash (1) K-means clustering

bull Use tf-idf on Wikipedia and Google News

bull K=50 (size of ascii alphabet = 52)

ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese

bull Simplified Chinese 现代汉语常用字表(3500)

bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)

bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998

ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much

bull Generate 130 clusters from product of (1) and (2)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 51

Normalization for twitter

bull Remove simply

ndash URL

ndash mention

ndash hash tag

ndash RT

ndash face mark using alphabet like XD p

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 52

Normalization for twitter-Specific Representation

bull How to Like lsquocoooooooollllllrsquo

bull Case 1 Make a normalization dictionary using [Brody+ 2011]

ndash Unsupervised normalization like coooollll rarr cool

ndash It canrsquot handle words that are not in the dictionary

bull Case 2 If the same character continues in more than 3 Shrink it to 2

ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of

bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on

bull Acronym (like WWW СССР) is not useful for language detection

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 53

Laugh Normalization

bull There are various laughs on each language

ndash HOW MUCH DO YOU LOVE COACH BEISTE

HHAHAHAHAHAH

ndash Hihihihi ) Habe ich regulaumlr 2x die Woche

ndash Tafil con eso Jajajajajajaja

ndash Malo Jejejeje XP

ndash kekeke chỗ đoacute lagravem aacuteo được ko em

bull Shrink them to double

ndash hahahha rArr haha

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 54

Implementation and Estimation

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 55

Language Detection with Infinity-Gram (ldig)

bull tweet language detection for Latin

alphabet

ndash httpsgithubcomshuyoldig

bull MIT license

bull Distribute also the trained model here

ndash infin-gram LR(maximal substring) [Okanohara+ 09]

ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]

ndash Double Array

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 56

Usage (1) Model Initialization

bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]

ndash Extract features from corpus and initialize model

ndash -m model directory

ndash -x path of maximal substring extractor (execute as external process)

ndash --ff Ignore less than the specified value

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 57

Maximal String Extractor

bull maxsubst [input file] [output file]

ndash Input as multiple line text

bull Replace TABs to ldquo ldquo line feeds to U+0001 in it

ndash Output as rdquo[features]yent[frequency]rdquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 58

Usage (2) Learn

bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]

ndash Learn the model using the corpus on 1 cycle of SGD

ndash -e learning rate of SGD

ndash -r regularizer of L1 regularization

ndash --wr what times to regularize for whole parameters

bull Parameters are too many to regularize the whle ones every step

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 59

Usage (3) Shrink Model

bull ldigpy -m [model] --shrink

ndash Remove Unefficient features(all

parameters of which are 0) from the

model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 60

Usage (4) Detect Language

bull ldigpy -m [model] [test data]

ndash Detect languages of test data and output

its result and summary

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 61

Data Format

bull Training and test data

ndash [correct label]yent[meta data]yent[text]

en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD

62

Usage (5) Estimation Tool

bull serverpy -m [model] -p [port number]

ndash Open httplocalhost[port] after it is executed

ndash Output their language probabilities contained features and their parameters for a text inputed in the text area

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 63

Estimation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus

As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size

64

language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992

total 166711 165053 9901 922 974

Estimation for LIGA dataset

bull Estimate using LIGA[Tromp+ 11] dataset

with 9066 tweets for 6 languages

ndash httpwwwwintuenl~mpechenprojectssmm

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Use 19 language model

65

Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996

total 9066 8992 992

Estimation for Europarl Dataset

Only supported languages for ldig

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 66

ldig langdetect CLDlanguage size correct rate correct rate correct rate

bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993

total 21000 13957 997 20850 993 20814 991

Conclusions

bull Language detector using maximal substring model

ndash Detect over 99 accuracy for 19 languages

ndash langdetect with tweet corpus even has 97 accuracy

bull If the corpus is maintained the precision will be still up

ndash There are still many mistakes (in particular da and no)

bull If metadata is added to features the precision will be still up

ndash How to add and train metadata at low cost

bull Desire to shrink the model without loss of precision

ndash Too large for application (gt100MB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 67

References

bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定

bull [Okanohara+ 09] Text Categorization with All Substring Features

bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs

bull [Cavnar+ 94] N-Gram-Based Text Categorization

bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 68

Page 44: Short Text Language Detection with Infinity-Gram

Normalization for Substitute Characters

bull lsquost with cedillarsquo are substitute characters

ndash But they are more popular than the others

ndash with cedilla with comma = 2 1

ndash ldquoRumanian IMErdquo outputs the substitutes too D

bull Regard lsquost with commarsquo as lsquost with cedillarsquo

ț ţ U+021B U+0163

I reckon it is similar to the relationship of

Japanese character lsquoSArsquo さ さ Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 46

Arabic Character Normalization (on language-detection)

bull Arabic and Persian have the similar trouble too

bull Character lsquoyehrsquo in Farsi corresponds to 2 code points

ndash Wikipedia uses ی (U+06cc Farsi yeh) only

ndash News uses ي(U+064a Arabic yeh) only

bull U+064a is a substitute in Farsi

ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc

ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails

bull Regard U+06cc as U+064a

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 47

Normalization for Vietnamese (1)

bull Vietnamese has 12 vowels

ndash a ă acirc e ecirc i y o ocirc ơ u ư

bull Vietnamese has 6 tones

ndash a ả agrave atilde aacute ạ

ndash These tone symbols are used also in general documents like news

bull The tone symbols can be appended to all vowels

ndash 12 6 = 72

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 48

Normalization for Vietnamese (2)

bull Representation of vowels with

tones

1 Use U+1ea0 - U+1ef9

bull ẵ = U+1eb5

2 Combine with Diacritical Marks

bull ẵ = U+0103 U+0303

ndash Half and half on news and tweet

bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 49

CJK-Kanji Normalization (1) (on language-detection)

bull CJK-Kanji has too many characters(more than 20K)

ndash Other character types have only 30-50 characters

bull The character space is very sparse

ndash Characters that donrsquot occur in the training corpus have no probabilities

bull eg 谢谢 Kanji for person name

ndash Common frequent characters are too strong

bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese

bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 50

CJK-Kanji Normalization (2) (on language-detection)

bull Group Kanjis by frequency and normalize each group to the representative character

ndash (1) K-means clustering

bull Use tf-idf on Wikipedia and Google News

bull K=50 (size of ascii alphabet = 52)

ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese

bull Simplified Chinese 现代汉语常用字表(3500)

bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)

bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998

ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much

bull Generate 130 clusters from product of (1) and (2)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 51

Normalization for twitter

bull Remove simply

ndash URL

ndash mention

ndash hash tag

ndash RT

ndash face mark using alphabet like XD p

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 52

Normalization for twitter-Specific Representation

bull How to Like lsquocoooooooollllllrsquo

bull Case 1 Make a normalization dictionary using [Brody+ 2011]

ndash Unsupervised normalization like coooollll rarr cool

ndash It canrsquot handle words that are not in the dictionary

bull Case 2 If the same character continues in more than 3 Shrink it to 2

ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of

bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on

bull Acronym (like WWW СССР) is not useful for language detection

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 53

Laugh Normalization

bull There are various laughs on each language

ndash HOW MUCH DO YOU LOVE COACH BEISTE

HHAHAHAHAHAH

ndash Hihihihi ) Habe ich regulaumlr 2x die Woche

ndash Tafil con eso Jajajajajajaja

ndash Malo Jejejeje XP

ndash kekeke chỗ đoacute lagravem aacuteo được ko em

bull Shrink them to double

ndash hahahha rArr haha

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 54

Implementation and Estimation

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 55

Language Detection with Infinity-Gram (ldig)

bull tweet language detection for Latin

alphabet

ndash httpsgithubcomshuyoldig

bull MIT license

bull Distribute also the trained model here

ndash infin-gram LR(maximal substring) [Okanohara+ 09]

ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]

ndash Double Array

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 56

Usage (1) Model Initialization

bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]

ndash Extract features from corpus and initialize model

ndash -m model directory

ndash -x path of maximal substring extractor (execute as external process)

ndash --ff Ignore less than the specified value

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 57

Maximal String Extractor

bull maxsubst [input file] [output file]

ndash Input as multiple line text

bull Replace TABs to ldquo ldquo line feeds to U+0001 in it

ndash Output as rdquo[features]yent[frequency]rdquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 58

Usage (2) Learn

bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]

ndash Learn the model using the corpus on 1 cycle of SGD

ndash -e learning rate of SGD

ndash -r regularizer of L1 regularization

ndash --wr what times to regularize for whole parameters

bull Parameters are too many to regularize the whle ones every step

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 59

Usage (3) Shrink Model

bull ldigpy -m [model] --shrink

ndash Remove Unefficient features(all

parameters of which are 0) from the

model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 60

Usage (4) Detect Language

bull ldigpy -m [model] [test data]

ndash Detect languages of test data and output

its result and summary

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 61

Data Format

bull Training and test data

ndash [correct label]yent[meta data]yent[text]

en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD

62

Usage (5) Estimation Tool

bull serverpy -m [model] -p [port number]

ndash Open httplocalhost[port] after it is executed

ndash Output their language probabilities contained features and their parameters for a text inputed in the text area

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 63

Estimation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus

As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size

64

language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992

total 166711 165053 9901 922 974

Estimation for LIGA dataset

bull Estimate using LIGA[Tromp+ 11] dataset

with 9066 tweets for 6 languages

ndash httpwwwwintuenl~mpechenprojectssmm

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Use 19 language model

65

Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996

total 9066 8992 992

Estimation for Europarl Dataset

Only supported languages for ldig

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 66

ldig langdetect CLDlanguage size correct rate correct rate correct rate

bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993

total 21000 13957 997 20850 993 20814 991

Conclusions

bull Language detector using maximal substring model

ndash Detect over 99 accuracy for 19 languages

ndash langdetect with tweet corpus even has 97 accuracy

bull If the corpus is maintained the precision will be still up

ndash There are still many mistakes (in particular da and no)

bull If metadata is added to features the precision will be still up

ndash How to add and train metadata at low cost

bull Desire to shrink the model without loss of precision

ndash Too large for application (gt100MB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 67

References

bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定

bull [Okanohara+ 09] Text Categorization with All Substring Features

bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs

bull [Cavnar+ 94] N-Gram-Based Text Categorization

bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 68

Page 45: Short Text Language Detection with Infinity-Gram

Arabic Character Normalization (on language-detection)

bull Arabic and Persian have the similar trouble too

bull Character lsquoyehrsquo in Farsi corresponds to 2 code points

ndash Wikipedia uses ی (U+06cc Farsi yeh) only

ndash News uses ي(U+064a Arabic yeh) only

bull U+064a is a substitute in Farsi

ndash The popular Arabic charset CP-1256 has no character mapped into U+06cc

ndash As lsquoyehrsquo is very often used in both languages quite all Persian text detection fails

bull Regard U+06cc as U+064a

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 47

Normalization for Vietnamese (1)

bull Vietnamese has 12 vowels

ndash a ă acirc e ecirc i y o ocirc ơ u ư

bull Vietnamese has 6 tones

ndash a ả agrave atilde aacute ạ

ndash These tone symbols are used also in general documents like news

bull The tone symbols can be appended to all vowels

ndash 12 6 = 72

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 48

Normalization for Vietnamese (2)

bull Representation of vowels with

tones

1 Use U+1ea0 - U+1ef9

bull ẵ = U+1eb5

2 Combine with Diacritical Marks

bull ẵ = U+0103 U+0303

ndash Half and half on news and tweet

bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 49

CJK-Kanji Normalization (1) (on language-detection)

bull CJK-Kanji has too many characters(more than 20K)

ndash Other character types have only 30-50 characters

bull The character space is very sparse

ndash Characters that donrsquot occur in the training corpus have no probabilities

bull eg 谢谢 Kanji for person name

ndash Common frequent characters are too strong

bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese

bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 50

CJK-Kanji Normalization (2) (on language-detection)

bull Group Kanjis by frequency and normalize each group to the representative character

ndash (1) K-means clustering

bull Use tf-idf on Wikipedia and Google News

bull K=50 (size of ascii alphabet = 52)

ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese

bull Simplified Chinese 现代汉语常用字表(3500)

bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)

bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998

ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much

bull Generate 130 clusters from product of (1) and (2)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 51

Normalization for twitter

bull Remove simply

ndash URL

ndash mention

ndash hash tag

ndash RT

ndash face mark using alphabet like XD p

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 52

Normalization for twitter-Specific Representation

bull How to Like lsquocoooooooollllllrsquo

bull Case 1 Make a normalization dictionary using [Brody+ 2011]

ndash Unsupervised normalization like coooollll rarr cool

ndash It canrsquot handle words that are not in the dictionary

bull Case 2 If the same character continues in more than 3 Shrink it to 2

ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of

bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on

bull Acronym (like WWW СССР) is not useful for language detection

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 53

Laugh Normalization

bull There are various laughs on each language

ndash HOW MUCH DO YOU LOVE COACH BEISTE

HHAHAHAHAHAH

ndash Hihihihi ) Habe ich regulaumlr 2x die Woche

ndash Tafil con eso Jajajajajajaja

ndash Malo Jejejeje XP

ndash kekeke chỗ đoacute lagravem aacuteo được ko em

bull Shrink them to double

ndash hahahha rArr haha

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 54

Implementation and Estimation

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 55

Language Detection with Infinity-Gram (ldig)

bull tweet language detection for Latin

alphabet

ndash httpsgithubcomshuyoldig

bull MIT license

bull Distribute also the trained model here

ndash infin-gram LR(maximal substring) [Okanohara+ 09]

ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]

ndash Double Array

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 56

Usage (1) Model Initialization

bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]

ndash Extract features from corpus and initialize model

ndash -m model directory

ndash -x path of maximal substring extractor (execute as external process)

ndash --ff Ignore less than the specified value

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 57

Maximal String Extractor

bull maxsubst [input file] [output file]

ndash Input as multiple line text

bull Replace TABs to ldquo ldquo line feeds to U+0001 in it

ndash Output as rdquo[features]yent[frequency]rdquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 58

Usage (2) Learn

bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]

ndash Learn the model using the corpus on 1 cycle of SGD

ndash -e learning rate of SGD

ndash -r regularizer of L1 regularization

ndash --wr what times to regularize for whole parameters

bull Parameters are too many to regularize the whle ones every step

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 59

Usage (3) Shrink Model

bull ldigpy -m [model] --shrink

ndash Remove Unefficient features(all

parameters of which are 0) from the

model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 60

Usage (4) Detect Language

bull ldigpy -m [model] [test data]

ndash Detect languages of test data and output

its result and summary

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 61

Data Format

bull Training and test data

ndash [correct label]yent[meta data]yent[text]

en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD

62

Usage (5) Estimation Tool

bull serverpy -m [model] -p [port number]

ndash Open httplocalhost[port] after it is executed

ndash Output their language probabilities contained features and their parameters for a text inputed in the text area

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 63

Estimation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus

As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size

64

language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992

total 166711 165053 9901 922 974

Estimation for LIGA dataset

bull Estimate using LIGA[Tromp+ 11] dataset

with 9066 tweets for 6 languages

ndash httpwwwwintuenl~mpechenprojectssmm

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Use 19 language model

65

Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996

total 9066 8992 992

Estimation for Europarl Dataset

Only supported languages for ldig

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 66

ldig langdetect CLDlanguage size correct rate correct rate correct rate

bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993

total 21000 13957 997 20850 993 20814 991

Conclusions

bull Language detector using maximal substring model

ndash Detect over 99 accuracy for 19 languages

ndash langdetect with tweet corpus even has 97 accuracy

bull If the corpus is maintained the precision will be still up

ndash There are still many mistakes (in particular da and no)

bull If metadata is added to features the precision will be still up

ndash How to add and train metadata at low cost

bull Desire to shrink the model without loss of precision

ndash Too large for application (gt100MB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 67

References

bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定

bull [Okanohara+ 09] Text Categorization with All Substring Features

bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs

bull [Cavnar+ 94] N-Gram-Based Text Categorization

bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 68

Page 46: Short Text Language Detection with Infinity-Gram

Normalization for Vietnamese (1)

bull Vietnamese has 12 vowels

ndash a ă acirc e ecirc i y o ocirc ơ u ư

bull Vietnamese has 6 tones

ndash a ả agrave atilde aacute ạ

ndash These tone symbols are used also in general documents like news

bull The tone symbols can be appended to all vowels

ndash 12 6 = 72

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 48

Normalization for Vietnamese (2)

bull Representation of vowels with

tones

1 Use U+1ea0 - U+1ef9

bull ẵ = U+1eb5

2 Combine with Diacritical Marks

bull ẵ = U+0103 U+0303

ndash Half and half on news and tweet

bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 49

CJK-Kanji Normalization (1) (on language-detection)

bull CJK-Kanji has too many characters(more than 20K)

ndash Other character types have only 30-50 characters

bull The character space is very sparse

ndash Characters that donrsquot occur in the training corpus have no probabilities

bull eg 谢谢 Kanji for person name

ndash Common frequent characters are too strong

bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese

bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 50

CJK-Kanji Normalization (2) (on language-detection)

bull Group Kanjis by frequency and normalize each group to the representative character

ndash (1) K-means clustering

bull Use tf-idf on Wikipedia and Google News

bull K=50 (size of ascii alphabet = 52)

ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese

bull Simplified Chinese 现代汉语常用字表(3500)

bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)

bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998

ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much

bull Generate 130 clusters from product of (1) and (2)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 51

Normalization for twitter

bull Remove simply

ndash URL

ndash mention

ndash hash tag

ndash RT

ndash face mark using alphabet like XD p

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 52

Normalization for twitter-Specific Representation

bull How to Like lsquocoooooooollllllrsquo

bull Case 1 Make a normalization dictionary using [Brody+ 2011]

ndash Unsupervised normalization like coooollll rarr cool

ndash It canrsquot handle words that are not in the dictionary

bull Case 2 If the same character continues in more than 3 Shrink it to 2

ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of

bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on

bull Acronym (like WWW СССР) is not useful for language detection

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 53

Laugh Normalization

bull There are various laughs on each language

ndash HOW MUCH DO YOU LOVE COACH BEISTE

HHAHAHAHAHAH

ndash Hihihihi ) Habe ich regulaumlr 2x die Woche

ndash Tafil con eso Jajajajajajaja

ndash Malo Jejejeje XP

ndash kekeke chỗ đoacute lagravem aacuteo được ko em

bull Shrink them to double

ndash hahahha rArr haha

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 54

Implementation and Estimation

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 55

Language Detection with Infinity-Gram (ldig)

bull tweet language detection for Latin

alphabet

ndash httpsgithubcomshuyoldig

bull MIT license

bull Distribute also the trained model here

ndash infin-gram LR(maximal substring) [Okanohara+ 09]

ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]

ndash Double Array

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 56

Usage (1) Model Initialization

bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]

ndash Extract features from corpus and initialize model

ndash -m model directory

ndash -x path of maximal substring extractor (execute as external process)

ndash --ff Ignore less than the specified value

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 57

Maximal String Extractor

bull maxsubst [input file] [output file]

ndash Input as multiple line text

bull Replace TABs to ldquo ldquo line feeds to U+0001 in it

ndash Output as rdquo[features]yent[frequency]rdquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 58

Usage (2) Learn

bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]

ndash Learn the model using the corpus on 1 cycle of SGD

ndash -e learning rate of SGD

ndash -r regularizer of L1 regularization

ndash --wr what times to regularize for whole parameters

bull Parameters are too many to regularize the whle ones every step

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 59

Usage (3) Shrink Model

bull ldigpy -m [model] --shrink

ndash Remove Unefficient features(all

parameters of which are 0) from the

model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 60

Usage (4) Detect Language

bull ldigpy -m [model] [test data]

ndash Detect languages of test data and output

its result and summary

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 61

Data Format

bull Training and test data

ndash [correct label]yent[meta data]yent[text]

en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD

62

Usage (5) Estimation Tool

bull serverpy -m [model] -p [port number]

ndash Open httplocalhost[port] after it is executed

ndash Output their language probabilities contained features and their parameters for a text inputed in the text area

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 63

Estimation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus

As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size

64

language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992

total 166711 165053 9901 922 974

Estimation for LIGA dataset

bull Estimate using LIGA[Tromp+ 11] dataset

with 9066 tweets for 6 languages

ndash httpwwwwintuenl~mpechenprojectssmm

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Use 19 language model

65

Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996

total 9066 8992 992

Estimation for Europarl Dataset

Only supported languages for ldig

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 66

ldig langdetect CLDlanguage size correct rate correct rate correct rate

bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993

total 21000 13957 997 20850 993 20814 991

Conclusions

bull Language detector using maximal substring model

ndash Detect over 99 accuracy for 19 languages

ndash langdetect with tweet corpus even has 97 accuracy

bull If the corpus is maintained the precision will be still up

ndash There are still many mistakes (in particular da and no)

bull If metadata is added to features the precision will be still up

ndash How to add and train metadata at low cost

bull Desire to shrink the model without loss of precision

ndash Too large for application (gt100MB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 67

References

bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定

bull [Okanohara+ 09] Text Categorization with All Substring Features

bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs

bull [Cavnar+ 94] N-Gram-Based Text Categorization

bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 68

Page 47: Short Text Language Detection with Infinity-Gram

Normalization for Vietnamese (2)

bull Representation of vowels with

tones

1 Use U+1ea0 - U+1ef9

bull ẵ = U+1eb5

2 Combine with Diacritical Marks

bull ẵ = U+0103 U+0303

ndash Half and half on news and tweet

bull Normalize 2 into 1 Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 49

CJK-Kanji Normalization (1) (on language-detection)

bull CJK-Kanji has too many characters(more than 20K)

ndash Other character types have only 30-50 characters

bull The character space is very sparse

ndash Characters that donrsquot occur in the training corpus have no probabilities

bull eg 谢谢 Kanji for person name

ndash Common frequent characters are too strong

bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese

bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 50

CJK-Kanji Normalization (2) (on language-detection)

bull Group Kanjis by frequency and normalize each group to the representative character

ndash (1) K-means clustering

bull Use tf-idf on Wikipedia and Google News

bull K=50 (size of ascii alphabet = 52)

ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese

bull Simplified Chinese 现代汉语常用字表(3500)

bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)

bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998

ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much

bull Generate 130 clusters from product of (1) and (2)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 51

Normalization for twitter

bull Remove simply

ndash URL

ndash mention

ndash hash tag

ndash RT

ndash face mark using alphabet like XD p

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 52

Normalization for twitter-Specific Representation

bull How to Like lsquocoooooooollllllrsquo

bull Case 1 Make a normalization dictionary using [Brody+ 2011]

ndash Unsupervised normalization like coooollll rarr cool

ndash It canrsquot handle words that are not in the dictionary

bull Case 2 If the same character continues in more than 3 Shrink it to 2

ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of

bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on

bull Acronym (like WWW СССР) is not useful for language detection

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 53

Laugh Normalization

bull There are various laughs on each language

ndash HOW MUCH DO YOU LOVE COACH BEISTE

HHAHAHAHAHAH

ndash Hihihihi ) Habe ich regulaumlr 2x die Woche

ndash Tafil con eso Jajajajajajaja

ndash Malo Jejejeje XP

ndash kekeke chỗ đoacute lagravem aacuteo được ko em

bull Shrink them to double

ndash hahahha rArr haha

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 54

Implementation and Estimation

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 55

Language Detection with Infinity-Gram (ldig)

bull tweet language detection for Latin

alphabet

ndash httpsgithubcomshuyoldig

bull MIT license

bull Distribute also the trained model here

ndash infin-gram LR(maximal substring) [Okanohara+ 09]

ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]

ndash Double Array

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 56

Usage (1) Model Initialization

bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]

ndash Extract features from corpus and initialize model

ndash -m model directory

ndash -x path of maximal substring extractor (execute as external process)

ndash --ff Ignore less than the specified value

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 57

Maximal String Extractor

bull maxsubst [input file] [output file]

ndash Input as multiple line text

bull Replace TABs to ldquo ldquo line feeds to U+0001 in it

ndash Output as rdquo[features]yent[frequency]rdquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 58

Usage (2) Learn

bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]

ndash Learn the model using the corpus on 1 cycle of SGD

ndash -e learning rate of SGD

ndash -r regularizer of L1 regularization

ndash --wr what times to regularize for whole parameters

bull Parameters are too many to regularize the whle ones every step

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 59

Usage (3) Shrink Model

bull ldigpy -m [model] --shrink

ndash Remove Unefficient features(all

parameters of which are 0) from the

model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 60

Usage (4) Detect Language

bull ldigpy -m [model] [test data]

ndash Detect languages of test data and output

its result and summary

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 61

Data Format

bull Training and test data

ndash [correct label]yent[meta data]yent[text]

en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD

62

Usage (5) Estimation Tool

bull serverpy -m [model] -p [port number]

ndash Open httplocalhost[port] after it is executed

ndash Output their language probabilities contained features and their parameters for a text inputed in the text area

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 63

Estimation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus

As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size

64

language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992

total 166711 165053 9901 922 974

Estimation for LIGA dataset

bull Estimate using LIGA[Tromp+ 11] dataset

with 9066 tweets for 6 languages

ndash httpwwwwintuenl~mpechenprojectssmm

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Use 19 language model

65

Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996

total 9066 8992 992

Estimation for Europarl Dataset

Only supported languages for ldig

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 66

ldig langdetect CLDlanguage size correct rate correct rate correct rate

bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993

total 21000 13957 997 20850 993 20814 991

Conclusions

bull Language detector using maximal substring model

ndash Detect over 99 accuracy for 19 languages

ndash langdetect with tweet corpus even has 97 accuracy

bull If the corpus is maintained the precision will be still up

ndash There are still many mistakes (in particular da and no)

bull If metadata is added to features the precision will be still up

ndash How to add and train metadata at low cost

bull Desire to shrink the model without loss of precision

ndash Too large for application (gt100MB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 67

References

bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定

bull [Okanohara+ 09] Text Categorization with All Substring Features

bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs

bull [Cavnar+ 94] N-Gram-Based Text Categorization

bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 68

Page 48: Short Text Language Detection with Infinity-Gram

CJK-Kanji Normalization (1) (on language-detection)

bull CJK-Kanji has too many characters(more than 20K)

ndash Other character types have only 30-50 characters

bull The character space is very sparse

ndash Characters that donrsquot occur in the training corpus have no probabilities

bull eg 谢谢 Kanji for person name

ndash Common frequent characters are too strong

bull eg a text which has rdquo的rdquo tends to be detected as Traditional Chinese

bull Hence Kana is used in Japanese too the probabilities of Kanji in Japanese are less than ones in Chinese

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 50

CJK-Kanji Normalization (2) (on language-detection)

bull Group Kanjis by frequency and normalize each group to the representative character

ndash (1) K-means clustering

bull Use tf-idf on Wikipedia and Google News

bull K=50 (size of ascii alphabet = 52)

ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese

bull Simplified Chinese 现代汉语常用字表(3500)

bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)

bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998

ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much

bull Generate 130 clusters from product of (1) and (2)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 51

Normalization for twitter

bull Remove simply

ndash URL

ndash mention

ndash hash tag

ndash RT

ndash face mark using alphabet like XD p

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 52

Normalization for twitter-Specific Representation

bull How to Like lsquocoooooooollllllrsquo

bull Case 1 Make a normalization dictionary using [Brody+ 2011]

ndash Unsupervised normalization like coooollll rarr cool

ndash It canrsquot handle words that are not in the dictionary

bull Case 2 If the same character continues in more than 3 Shrink it to 2

ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of

bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on

bull Acronym (like WWW СССР) is not useful for language detection

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 53

Laugh Normalization

bull There are various laughs on each language

ndash HOW MUCH DO YOU LOVE COACH BEISTE

HHAHAHAHAHAH

ndash Hihihihi ) Habe ich regulaumlr 2x die Woche

ndash Tafil con eso Jajajajajajaja

ndash Malo Jejejeje XP

ndash kekeke chỗ đoacute lagravem aacuteo được ko em

bull Shrink them to double

ndash hahahha rArr haha

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 54

Implementation and Estimation

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 55

Language Detection with Infinity-Gram (ldig)

bull tweet language detection for Latin

alphabet

ndash httpsgithubcomshuyoldig

bull MIT license

bull Distribute also the trained model here

ndash infin-gram LR(maximal substring) [Okanohara+ 09]

ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]

ndash Double Array

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 56

Usage (1) Model Initialization

bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]

ndash Extract features from corpus and initialize model

ndash -m model directory

ndash -x path of maximal substring extractor (execute as external process)

ndash --ff Ignore less than the specified value

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 57

Maximal String Extractor

bull maxsubst [input file] [output file]

ndash Input as multiple line text

bull Replace TABs to ldquo ldquo line feeds to U+0001 in it

ndash Output as rdquo[features]yent[frequency]rdquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 58

Usage (2) Learn

bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]

ndash Learn the model using the corpus on 1 cycle of SGD

ndash -e learning rate of SGD

ndash -r regularizer of L1 regularization

ndash --wr what times to regularize for whole parameters

bull Parameters are too many to regularize the whle ones every step

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 59

Usage (3) Shrink Model

bull ldigpy -m [model] --shrink

ndash Remove Unefficient features(all

parameters of which are 0) from the

model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 60

Usage (4) Detect Language

bull ldigpy -m [model] [test data]

ndash Detect languages of test data and output

its result and summary

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 61

Data Format

bull Training and test data

ndash [correct label]yent[meta data]yent[text]

en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD

62

Usage (5) Estimation Tool

bull serverpy -m [model] -p [port number]

ndash Open httplocalhost[port] after it is executed

ndash Output their language probabilities contained features and their parameters for a text inputed in the text area

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 63

Estimation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus

As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size

64

language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992

total 166711 165053 9901 922 974

Estimation for LIGA dataset

bull Estimate using LIGA[Tromp+ 11] dataset

with 9066 tweets for 6 languages

ndash httpwwwwintuenl~mpechenprojectssmm

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Use 19 language model

65

Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996

total 9066 8992 992

Estimation for Europarl Dataset

Only supported languages for ldig

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 66

ldig langdetect CLDlanguage size correct rate correct rate correct rate

bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993

total 21000 13957 997 20850 993 20814 991

Conclusions

bull Language detector using maximal substring model

ndash Detect over 99 accuracy for 19 languages

ndash langdetect with tweet corpus even has 97 accuracy

bull If the corpus is maintained the precision will be still up

ndash There are still many mistakes (in particular da and no)

bull If metadata is added to features the precision will be still up

ndash How to add and train metadata at low cost

bull Desire to shrink the model without loss of precision

ndash Too large for application (gt100MB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 67

References

bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定

bull [Okanohara+ 09] Text Categorization with All Substring Features

bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs

bull [Cavnar+ 94] N-Gram-Based Text Categorization

bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 68

Page 49: Short Text Language Detection with Infinity-Gram

CJK-Kanji Normalization (2) (on language-detection)

bull Group Kanjis by frequency and normalize each group to the representative character

ndash (1) K-means clustering

bull Use tf-idf on Wikipedia and Google News

bull K=50 (size of ascii alphabet = 52)

ndash (2) ldquoCommonly Used Kanjirdquo provided in Japanese and Chinese

bull Simplified Chinese 现代汉语常用字表(3500)

bull Traditional Chinese 常用国字標準字体表(4808) sub Big5 the first standard(5401)

bull Japanese 常用漢字(2136)cup JIS the first standard(2965) = 2998

ndash 常用漢字 doesnrsquot have Kanji for person name and place name very much

bull Generate 130 clusters from product of (1) and (2)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 51

Normalization for twitter

bull Remove simply

ndash URL

ndash mention

ndash hash tag

ndash RT

ndash face mark using alphabet like XD p

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 52

Normalization for twitter-Specific Representation

bull How to Like lsquocoooooooollllllrsquo

bull Case 1 Make a normalization dictionary using [Brody+ 2011]

ndash Unsupervised normalization like coooollll rarr cool

ndash It canrsquot handle words that are not in the dictionary

bull Case 2 If the same character continues in more than 3 Shrink it to 2

ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of

bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on

bull Acronym (like WWW СССР) is not useful for language detection

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 53

Laugh Normalization

bull There are various laughs on each language

ndash HOW MUCH DO YOU LOVE COACH BEISTE

HHAHAHAHAHAH

ndash Hihihihi ) Habe ich regulaumlr 2x die Woche

ndash Tafil con eso Jajajajajajaja

ndash Malo Jejejeje XP

ndash kekeke chỗ đoacute lagravem aacuteo được ko em

bull Shrink them to double

ndash hahahha rArr haha

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 54

Implementation and Estimation

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 55

Language Detection with Infinity-Gram (ldig)

bull tweet language detection for Latin

alphabet

ndash httpsgithubcomshuyoldig

bull MIT license

bull Distribute also the trained model here

ndash infin-gram LR(maximal substring) [Okanohara+ 09]

ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]

ndash Double Array

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 56

Usage (1) Model Initialization

bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]

ndash Extract features from corpus and initialize model

ndash -m model directory

ndash -x path of maximal substring extractor (execute as external process)

ndash --ff Ignore less than the specified value

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 57

Maximal String Extractor

bull maxsubst [input file] [output file]

ndash Input as multiple line text

bull Replace TABs to ldquo ldquo line feeds to U+0001 in it

ndash Output as rdquo[features]yent[frequency]rdquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 58

Usage (2) Learn

bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]

ndash Learn the model using the corpus on 1 cycle of SGD

ndash -e learning rate of SGD

ndash -r regularizer of L1 regularization

ndash --wr what times to regularize for whole parameters

bull Parameters are too many to regularize the whle ones every step

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 59

Usage (3) Shrink Model

bull ldigpy -m [model] --shrink

ndash Remove Unefficient features(all

parameters of which are 0) from the

model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 60

Usage (4) Detect Language

bull ldigpy -m [model] [test data]

ndash Detect languages of test data and output

its result and summary

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 61

Data Format

bull Training and test data

ndash [correct label]yent[meta data]yent[text]

en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD

62

Usage (5) Estimation Tool

bull serverpy -m [model] -p [port number]

ndash Open httplocalhost[port] after it is executed

ndash Output their language probabilities contained features and their parameters for a text inputed in the text area

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 63

Estimation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus

As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size

64

language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992

total 166711 165053 9901 922 974

Estimation for LIGA dataset

bull Estimate using LIGA[Tromp+ 11] dataset

with 9066 tweets for 6 languages

ndash httpwwwwintuenl~mpechenprojectssmm

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Use 19 language model

65

Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996

total 9066 8992 992

Estimation for Europarl Dataset

Only supported languages for ldig

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 66

ldig langdetect CLDlanguage size correct rate correct rate correct rate

bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993

total 21000 13957 997 20850 993 20814 991

Conclusions

bull Language detector using maximal substring model

ndash Detect over 99 accuracy for 19 languages

ndash langdetect with tweet corpus even has 97 accuracy

bull If the corpus is maintained the precision will be still up

ndash There are still many mistakes (in particular da and no)

bull If metadata is added to features the precision will be still up

ndash How to add and train metadata at low cost

bull Desire to shrink the model without loss of precision

ndash Too large for application (gt100MB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 67

References

bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定

bull [Okanohara+ 09] Text Categorization with All Substring Features

bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs

bull [Cavnar+ 94] N-Gram-Based Text Categorization

bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 68

Page 50: Short Text Language Detection with Infinity-Gram

Normalization for twitter

bull Remove simply

ndash URL

ndash mention

ndash hash tag

ndash RT

ndash face mark using alphabet like XD p

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 52

Normalization for twitter-Specific Representation

bull How to Like lsquocoooooooollllllrsquo

bull Case 1 Make a normalization dictionary using [Brody+ 2011]

ndash Unsupervised normalization like coooollll rarr cool

ndash It canrsquot handle words that are not in the dictionary

bull Case 2 If the same character continues in more than 3 Shrink it to 2

ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of

bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on

bull Acronym (like WWW СССР) is not useful for language detection

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 53

Laugh Normalization

bull There are various laughs on each language

ndash HOW MUCH DO YOU LOVE COACH BEISTE

HHAHAHAHAHAH

ndash Hihihihi ) Habe ich regulaumlr 2x die Woche

ndash Tafil con eso Jajajajajajaja

ndash Malo Jejejeje XP

ndash kekeke chỗ đoacute lagravem aacuteo được ko em

bull Shrink them to double

ndash hahahha rArr haha

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 54

Implementation and Estimation

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 55

Language Detection with Infinity-Gram (ldig)

bull tweet language detection for Latin

alphabet

ndash httpsgithubcomshuyoldig

bull MIT license

bull Distribute also the trained model here

ndash infin-gram LR(maximal substring) [Okanohara+ 09]

ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]

ndash Double Array

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 56

Usage (1) Model Initialization

bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]

ndash Extract features from corpus and initialize model

ndash -m model directory

ndash -x path of maximal substring extractor (execute as external process)

ndash --ff Ignore less than the specified value

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 57

Maximal String Extractor

bull maxsubst [input file] [output file]

ndash Input as multiple line text

bull Replace TABs to ldquo ldquo line feeds to U+0001 in it

ndash Output as rdquo[features]yent[frequency]rdquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 58

Usage (2) Learn

bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]

ndash Learn the model using the corpus on 1 cycle of SGD

ndash -e learning rate of SGD

ndash -r regularizer of L1 regularization

ndash --wr what times to regularize for whole parameters

bull Parameters are too many to regularize the whle ones every step

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 59

Usage (3) Shrink Model

bull ldigpy -m [model] --shrink

ndash Remove Unefficient features(all

parameters of which are 0) from the

model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 60

Usage (4) Detect Language

bull ldigpy -m [model] [test data]

ndash Detect languages of test data and output

its result and summary

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 61

Data Format

bull Training and test data

ndash [correct label]yent[meta data]yent[text]

en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD

62

Usage (5) Estimation Tool

bull serverpy -m [model] -p [port number]

ndash Open httplocalhost[port] after it is executed

ndash Output their language probabilities contained features and their parameters for a text inputed in the text area

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 63

Estimation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus

As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size

64

language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992

total 166711 165053 9901 922 974

Estimation for LIGA dataset

bull Estimate using LIGA[Tromp+ 11] dataset

with 9066 tweets for 6 languages

ndash httpwwwwintuenl~mpechenprojectssmm

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Use 19 language model

65

Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996

total 9066 8992 992

Estimation for Europarl Dataset

Only supported languages for ldig

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 66

ldig langdetect CLDlanguage size correct rate correct rate correct rate

bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993

total 21000 13957 997 20850 993 20814 991

Conclusions

bull Language detector using maximal substring model

ndash Detect over 99 accuracy for 19 languages

ndash langdetect with tweet corpus even has 97 accuracy

bull If the corpus is maintained the precision will be still up

ndash There are still many mistakes (in particular da and no)

bull If metadata is added to features the precision will be still up

ndash How to add and train metadata at low cost

bull Desire to shrink the model without loss of precision

ndash Too large for application (gt100MB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 67

References

bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定

bull [Okanohara+ 09] Text Categorization with All Substring Features

bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs

bull [Cavnar+ 94] N-Gram-Based Text Categorization

bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 68

Page 51: Short Text Language Detection with Infinity-Gram

Normalization for twitter-Specific Representation

bull How to Like lsquocoooooooollllllrsquo

bull Case 1 Make a normalization dictionary using [Brody+ 2011]

ndash Unsupervised normalization like coooollll rarr cool

ndash It canrsquot handle words that are not in the dictionary

bull Case 2 If the same character continues in more than 3 Shrink it to 2

ndash There is no language which over 3 continuation of the same Latin alphabet in orthography of

bull If in Japanese there are ldquoかたたたきrdquo ldquoかわいいいぬrdquo ldquoあわてててrdquo and so on

bull Acronym (like WWW СССР) is not useful for language detection

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 53

Laugh Normalization

bull There are various laughs on each language

ndash HOW MUCH DO YOU LOVE COACH BEISTE

HHAHAHAHAHAH

ndash Hihihihi ) Habe ich regulaumlr 2x die Woche

ndash Tafil con eso Jajajajajajaja

ndash Malo Jejejeje XP

ndash kekeke chỗ đoacute lagravem aacuteo được ko em

bull Shrink them to double

ndash hahahha rArr haha

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 54

Implementation and Estimation

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 55

Language Detection with Infinity-Gram (ldig)

bull tweet language detection for Latin

alphabet

ndash httpsgithubcomshuyoldig

bull MIT license

bull Distribute also the trained model here

ndash infin-gram LR(maximal substring) [Okanohara+ 09]

ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]

ndash Double Array

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 56

Usage (1) Model Initialization

bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]

ndash Extract features from corpus and initialize model

ndash -m model directory

ndash -x path of maximal substring extractor (execute as external process)

ndash --ff Ignore less than the specified value

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 57

Maximal String Extractor

bull maxsubst [input file] [output file]

ndash Input as multiple line text

bull Replace TABs to ldquo ldquo line feeds to U+0001 in it

ndash Output as rdquo[features]yent[frequency]rdquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 58

Usage (2) Learn

bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]

ndash Learn the model using the corpus on 1 cycle of SGD

ndash -e learning rate of SGD

ndash -r regularizer of L1 regularization

ndash --wr what times to regularize for whole parameters

bull Parameters are too many to regularize the whle ones every step

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 59

Usage (3) Shrink Model

bull ldigpy -m [model] --shrink

ndash Remove Unefficient features(all

parameters of which are 0) from the

model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 60

Usage (4) Detect Language

bull ldigpy -m [model] [test data]

ndash Detect languages of test data and output

its result and summary

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 61

Data Format

bull Training and test data

ndash [correct label]yent[meta data]yent[text]

en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD

62

Usage (5) Estimation Tool

bull serverpy -m [model] -p [port number]

ndash Open httplocalhost[port] after it is executed

ndash Output their language probabilities contained features and their parameters for a text inputed in the text area

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 63

Estimation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus

As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size

64

language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992

total 166711 165053 9901 922 974

Estimation for LIGA dataset

bull Estimate using LIGA[Tromp+ 11] dataset

with 9066 tweets for 6 languages

ndash httpwwwwintuenl~mpechenprojectssmm

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Use 19 language model

65

Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996

total 9066 8992 992

Estimation for Europarl Dataset

Only supported languages for ldig

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 66

ldig langdetect CLDlanguage size correct rate correct rate correct rate

bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993

total 21000 13957 997 20850 993 20814 991

Conclusions

bull Language detector using maximal substring model

ndash Detect over 99 accuracy for 19 languages

ndash langdetect with tweet corpus even has 97 accuracy

bull If the corpus is maintained the precision will be still up

ndash There are still many mistakes (in particular da and no)

bull If metadata is added to features the precision will be still up

ndash How to add and train metadata at low cost

bull Desire to shrink the model without loss of precision

ndash Too large for application (gt100MB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 67

References

bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定

bull [Okanohara+ 09] Text Categorization with All Substring Features

bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs

bull [Cavnar+ 94] N-Gram-Based Text Categorization

bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 68

Page 52: Short Text Language Detection with Infinity-Gram

Laugh Normalization

bull There are various laughs on each language

ndash HOW MUCH DO YOU LOVE COACH BEISTE

HHAHAHAHAHAH

ndash Hihihihi ) Habe ich regulaumlr 2x die Woche

ndash Tafil con eso Jajajajajajaja

ndash Malo Jejejeje XP

ndash kekeke chỗ đoacute lagravem aacuteo được ko em

bull Shrink them to double

ndash hahahha rArr haha

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 54

Implementation and Estimation

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 55

Language Detection with Infinity-Gram (ldig)

bull tweet language detection for Latin

alphabet

ndash httpsgithubcomshuyoldig

bull MIT license

bull Distribute also the trained model here

ndash infin-gram LR(maximal substring) [Okanohara+ 09]

ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]

ndash Double Array

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 56

Usage (1) Model Initialization

bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]

ndash Extract features from corpus and initialize model

ndash -m model directory

ndash -x path of maximal substring extractor (execute as external process)

ndash --ff Ignore less than the specified value

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 57

Maximal String Extractor

bull maxsubst [input file] [output file]

ndash Input as multiple line text

bull Replace TABs to ldquo ldquo line feeds to U+0001 in it

ndash Output as rdquo[features]yent[frequency]rdquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 58

Usage (2) Learn

bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]

ndash Learn the model using the corpus on 1 cycle of SGD

ndash -e learning rate of SGD

ndash -r regularizer of L1 regularization

ndash --wr what times to regularize for whole parameters

bull Parameters are too many to regularize the whle ones every step

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 59

Usage (3) Shrink Model

bull ldigpy -m [model] --shrink

ndash Remove Unefficient features(all

parameters of which are 0) from the

model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 60

Usage (4) Detect Language

bull ldigpy -m [model] [test data]

ndash Detect languages of test data and output

its result and summary

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 61

Data Format

bull Training and test data

ndash [correct label]yent[meta data]yent[text]

en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD

62

Usage (5) Estimation Tool

bull serverpy -m [model] -p [port number]

ndash Open httplocalhost[port] after it is executed

ndash Output their language probabilities contained features and their parameters for a text inputed in the text area

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 63

Estimation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus

As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size

64

language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992

total 166711 165053 9901 922 974

Estimation for LIGA dataset

bull Estimate using LIGA[Tromp+ 11] dataset

with 9066 tweets for 6 languages

ndash httpwwwwintuenl~mpechenprojectssmm

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Use 19 language model

65

Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996

total 9066 8992 992

Estimation for Europarl Dataset

Only supported languages for ldig

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 66

ldig langdetect CLDlanguage size correct rate correct rate correct rate

bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993

total 21000 13957 997 20850 993 20814 991

Conclusions

bull Language detector using maximal substring model

ndash Detect over 99 accuracy for 19 languages

ndash langdetect with tweet corpus even has 97 accuracy

bull If the corpus is maintained the precision will be still up

ndash There are still many mistakes (in particular da and no)

bull If metadata is added to features the precision will be still up

ndash How to add and train metadata at low cost

bull Desire to shrink the model without loss of precision

ndash Too large for application (gt100MB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 67

References

bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定

bull [Okanohara+ 09] Text Categorization with All Substring Features

bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs

bull [Cavnar+ 94] N-Gram-Based Text Categorization

bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 68

Page 53: Short Text Language Detection with Infinity-Gram

Implementation and Estimation

Short Text Language Detection with

Infinity-Gram (NAIST Seminar) 55

Language Detection with Infinity-Gram (ldig)

bull tweet language detection for Latin

alphabet

ndash httpsgithubcomshuyoldig

bull MIT license

bull Distribute also the trained model here

ndash infin-gram LR(maximal substring) [Okanohara+ 09]

ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]

ndash Double Array

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 56

Usage (1) Model Initialization

bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]

ndash Extract features from corpus and initialize model

ndash -m model directory

ndash -x path of maximal substring extractor (execute as external process)

ndash --ff Ignore less than the specified value

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 57

Maximal String Extractor

bull maxsubst [input file] [output file]

ndash Input as multiple line text

bull Replace TABs to ldquo ldquo line feeds to U+0001 in it

ndash Output as rdquo[features]yent[frequency]rdquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 58

Usage (2) Learn

bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]

ndash Learn the model using the corpus on 1 cycle of SGD

ndash -e learning rate of SGD

ndash -r regularizer of L1 regularization

ndash --wr what times to regularize for whole parameters

bull Parameters are too many to regularize the whle ones every step

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 59

Usage (3) Shrink Model

bull ldigpy -m [model] --shrink

ndash Remove Unefficient features(all

parameters of which are 0) from the

model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 60

Usage (4) Detect Language

bull ldigpy -m [model] [test data]

ndash Detect languages of test data and output

its result and summary

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 61

Data Format

bull Training and test data

ndash [correct label]yent[meta data]yent[text]

en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD

62

Usage (5) Estimation Tool

bull serverpy -m [model] -p [port number]

ndash Open httplocalhost[port] after it is executed

ndash Output their language probabilities contained features and their parameters for a text inputed in the text area

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 63

Estimation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus

As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size

64

language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992

total 166711 165053 9901 922 974

Estimation for LIGA dataset

bull Estimate using LIGA[Tromp+ 11] dataset

with 9066 tweets for 6 languages

ndash httpwwwwintuenl~mpechenprojectssmm

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Use 19 language model

65

Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996

total 9066 8992 992

Estimation for Europarl Dataset

Only supported languages for ldig

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 66

ldig langdetect CLDlanguage size correct rate correct rate correct rate

bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993

total 21000 13957 997 20850 993 20814 991

Conclusions

bull Language detector using maximal substring model

ndash Detect over 99 accuracy for 19 languages

ndash langdetect with tweet corpus even has 97 accuracy

bull If the corpus is maintained the precision will be still up

ndash There are still many mistakes (in particular da and no)

bull If metadata is added to features the precision will be still up

ndash How to add and train metadata at low cost

bull Desire to shrink the model without loss of precision

ndash Too large for application (gt100MB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 67

References

bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定

bull [Okanohara+ 09] Text Categorization with All Substring Features

bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs

bull [Cavnar+ 94] N-Gram-Based Text Categorization

bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 68

Page 54: Short Text Language Detection with Infinity-Gram

Language Detection with Infinity-Gram (ldig)

bull tweet language detection for Latin

alphabet

ndash httpsgithubcomshuyoldig

bull MIT license

bull Distribute also the trained model here

ndash infin-gram LR(maximal substring) [Okanohara+ 09]

ndash L1 SGD (Cumulative Penalty) [Tsuruoka+ 09]

ndash Double Array

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 56

Usage (1) Model Initialization

bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]

ndash Extract features from corpus and initialize model

ndash -m model directory

ndash -x path of maximal substring extractor (execute as external process)

ndash --ff Ignore less than the specified value

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 57

Maximal String Extractor

bull maxsubst [input file] [output file]

ndash Input as multiple line text

bull Replace TABs to ldquo ldquo line feeds to U+0001 in it

ndash Output as rdquo[features]yent[frequency]rdquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 58

Usage (2) Learn

bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]

ndash Learn the model using the corpus on 1 cycle of SGD

ndash -e learning rate of SGD

ndash -r regularizer of L1 regularization

ndash --wr what times to regularize for whole parameters

bull Parameters are too many to regularize the whle ones every step

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 59

Usage (3) Shrink Model

bull ldigpy -m [model] --shrink

ndash Remove Unefficient features(all

parameters of which are 0) from the

model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 60

Usage (4) Detect Language

bull ldigpy -m [model] [test data]

ndash Detect languages of test data and output

its result and summary

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 61

Data Format

bull Training and test data

ndash [correct label]yent[meta data]yent[text]

en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD

62

Usage (5) Estimation Tool

bull serverpy -m [model] -p [port number]

ndash Open httplocalhost[port] after it is executed

ndash Output their language probabilities contained features and their parameters for a text inputed in the text area

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 63

Estimation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus

As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size

64

language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992

total 166711 165053 9901 922 974

Estimation for LIGA dataset

bull Estimate using LIGA[Tromp+ 11] dataset

with 9066 tweets for 6 languages

ndash httpwwwwintuenl~mpechenprojectssmm

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Use 19 language model

65

Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996

total 9066 8992 992

Estimation for Europarl Dataset

Only supported languages for ldig

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 66

ldig langdetect CLDlanguage size correct rate correct rate correct rate

bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993

total 21000 13957 997 20850 993 20814 991

Conclusions

bull Language detector using maximal substring model

ndash Detect over 99 accuracy for 19 languages

ndash langdetect with tweet corpus even has 97 accuracy

bull If the corpus is maintained the precision will be still up

ndash There are still many mistakes (in particular da and no)

bull If metadata is added to features the precision will be still up

ndash How to add and train metadata at low cost

bull Desire to shrink the model without loss of precision

ndash Too large for application (gt100MB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 67

References

bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定

bull [Okanohara+ 09] Text Categorization with All Substring Features

bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs

bull [Cavnar+ 94] N-Gram-Based Text Categorization

bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 68

Page 55: Short Text Language Detection with Infinity-Gram

Usage (1) Model Initialization

bull ldigpy -m [model] --init [corpus] -x [maximal string extractor] --ff=[lower limit of frequency]

ndash Extract features from corpus and initialize model

ndash -m model directory

ndash -x path of maximal substring extractor (execute as external process)

ndash --ff Ignore less than the specified value

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 57

Maximal String Extractor

bull maxsubst [input file] [output file]

ndash Input as multiple line text

bull Replace TABs to ldquo ldquo line feeds to U+0001 in it

ndash Output as rdquo[features]yent[frequency]rdquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 58

Usage (2) Learn

bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]

ndash Learn the model using the corpus on 1 cycle of SGD

ndash -e learning rate of SGD

ndash -r regularizer of L1 regularization

ndash --wr what times to regularize for whole parameters

bull Parameters are too many to regularize the whle ones every step

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 59

Usage (3) Shrink Model

bull ldigpy -m [model] --shrink

ndash Remove Unefficient features(all

parameters of which are 0) from the

model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 60

Usage (4) Detect Language

bull ldigpy -m [model] [test data]

ndash Detect languages of test data and output

its result and summary

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 61

Data Format

bull Training and test data

ndash [correct label]yent[meta data]yent[text]

en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD

62

Usage (5) Estimation Tool

bull serverpy -m [model] -p [port number]

ndash Open httplocalhost[port] after it is executed

ndash Output their language probabilities contained features and their parameters for a text inputed in the text area

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 63

Estimation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus

As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size

64

language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992

total 166711 165053 9901 922 974

Estimation for LIGA dataset

bull Estimate using LIGA[Tromp+ 11] dataset

with 9066 tweets for 6 languages

ndash httpwwwwintuenl~mpechenprojectssmm

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Use 19 language model

65

Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996

total 9066 8992 992

Estimation for Europarl Dataset

Only supported languages for ldig

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 66

ldig langdetect CLDlanguage size correct rate correct rate correct rate

bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993

total 21000 13957 997 20850 993 20814 991

Conclusions

bull Language detector using maximal substring model

ndash Detect over 99 accuracy for 19 languages

ndash langdetect with tweet corpus even has 97 accuracy

bull If the corpus is maintained the precision will be still up

ndash There are still many mistakes (in particular da and no)

bull If metadata is added to features the precision will be still up

ndash How to add and train metadata at low cost

bull Desire to shrink the model without loss of precision

ndash Too large for application (gt100MB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 67

References

bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定

bull [Okanohara+ 09] Text Categorization with All Substring Features

bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs

bull [Cavnar+ 94] N-Gram-Based Text Categorization

bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 68

Page 56: Short Text Language Detection with Infinity-Gram

Maximal String Extractor

bull maxsubst [input file] [output file]

ndash Input as multiple line text

bull Replace TABs to ldquo ldquo line feeds to U+0001 in it

ndash Output as rdquo[features]yent[frequency]rdquo

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 58

Usage (2) Learn

bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]

ndash Learn the model using the corpus on 1 cycle of SGD

ndash -e learning rate of SGD

ndash -r regularizer of L1 regularization

ndash --wr what times to regularize for whole parameters

bull Parameters are too many to regularize the whle ones every step

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 59

Usage (3) Shrink Model

bull ldigpy -m [model] --shrink

ndash Remove Unefficient features(all

parameters of which are 0) from the

model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 60

Usage (4) Detect Language

bull ldigpy -m [model] [test data]

ndash Detect languages of test data and output

its result and summary

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 61

Data Format

bull Training and test data

ndash [correct label]yent[meta data]yent[text]

en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD

62

Usage (5) Estimation Tool

bull serverpy -m [model] -p [port number]

ndash Open httplocalhost[port] after it is executed

ndash Output their language probabilities contained features and their parameters for a text inputed in the text area

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 63

Estimation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus

As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size

64

language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992

total 166711 165053 9901 922 974

Estimation for LIGA dataset

bull Estimate using LIGA[Tromp+ 11] dataset

with 9066 tweets for 6 languages

ndash httpwwwwintuenl~mpechenprojectssmm

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Use 19 language model

65

Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996

total 9066 8992 992

Estimation for Europarl Dataset

Only supported languages for ldig

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 66

ldig langdetect CLDlanguage size correct rate correct rate correct rate

bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993

total 21000 13957 997 20850 993 20814 991

Conclusions

bull Language detector using maximal substring model

ndash Detect over 99 accuracy for 19 languages

ndash langdetect with tweet corpus even has 97 accuracy

bull If the corpus is maintained the precision will be still up

ndash There are still many mistakes (in particular da and no)

bull If metadata is added to features the precision will be still up

ndash How to add and train metadata at low cost

bull Desire to shrink the model without loss of precision

ndash Too large for application (gt100MB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 67

References

bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定

bull [Okanohara+ 09] Text Categorization with All Substring Features

bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs

bull [Cavnar+ 94] N-Gram-Based Text Categorization

bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 68

Page 57: Short Text Language Detection with Infinity-Gram

Usage (2) Learn

bull ldigpy -m [model] --learning [corpus] -e [learning rate] -r [regularizer] --wr=[whole regularization]

ndash Learn the model using the corpus on 1 cycle of SGD

ndash -e learning rate of SGD

ndash -r regularizer of L1 regularization

ndash --wr what times to regularize for whole parameters

bull Parameters are too many to regularize the whle ones every step

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 59

Usage (3) Shrink Model

bull ldigpy -m [model] --shrink

ndash Remove Unefficient features(all

parameters of which are 0) from the

model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 60

Usage (4) Detect Language

bull ldigpy -m [model] [test data]

ndash Detect languages of test data and output

its result and summary

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 61

Data Format

bull Training and test data

ndash [correct label]yent[meta data]yent[text]

en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD

62

Usage (5) Estimation Tool

bull serverpy -m [model] -p [port number]

ndash Open httplocalhost[port] after it is executed

ndash Output their language probabilities contained features and their parameters for a text inputed in the text area

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 63

Estimation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus

As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size

64

language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992

total 166711 165053 9901 922 974

Estimation for LIGA dataset

bull Estimate using LIGA[Tromp+ 11] dataset

with 9066 tweets for 6 languages

ndash httpwwwwintuenl~mpechenprojectssmm

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Use 19 language model

65

Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996

total 9066 8992 992

Estimation for Europarl Dataset

Only supported languages for ldig

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 66

ldig langdetect CLDlanguage size correct rate correct rate correct rate

bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993

total 21000 13957 997 20850 993 20814 991

Conclusions

bull Language detector using maximal substring model

ndash Detect over 99 accuracy for 19 languages

ndash langdetect with tweet corpus even has 97 accuracy

bull If the corpus is maintained the precision will be still up

ndash There are still many mistakes (in particular da and no)

bull If metadata is added to features the precision will be still up

ndash How to add and train metadata at low cost

bull Desire to shrink the model without loss of precision

ndash Too large for application (gt100MB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 67

References

bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定

bull [Okanohara+ 09] Text Categorization with All Substring Features

bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs

bull [Cavnar+ 94] N-Gram-Based Text Categorization

bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 68

Page 58: Short Text Language Detection with Infinity-Gram

Usage (3) Shrink Model

bull ldigpy -m [model] --shrink

ndash Remove Unefficient features(all

parameters of which are 0) from the

model

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 60

Usage (4) Detect Language

bull ldigpy -m [model] [test data]

ndash Detect languages of test data and output

its result and summary

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 61

Data Format

bull Training and test data

ndash [correct label]yent[meta data]yent[text]

en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD

62

Usage (5) Estimation Tool

bull serverpy -m [model] -p [port number]

ndash Open httplocalhost[port] after it is executed

ndash Output their language probabilities contained features and their parameters for a text inputed in the text area

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 63

Estimation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus

As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size

64

language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992

total 166711 165053 9901 922 974

Estimation for LIGA dataset

bull Estimate using LIGA[Tromp+ 11] dataset

with 9066 tweets for 6 languages

ndash httpwwwwintuenl~mpechenprojectssmm

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Use 19 language model

65

Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996

total 9066 8992 992

Estimation for Europarl Dataset

Only supported languages for ldig

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 66

ldig langdetect CLDlanguage size correct rate correct rate correct rate

bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993

total 21000 13957 997 20850 993 20814 991

Conclusions

bull Language detector using maximal substring model

ndash Detect over 99 accuracy for 19 languages

ndash langdetect with tweet corpus even has 97 accuracy

bull If the corpus is maintained the precision will be still up

ndash There are still many mistakes (in particular da and no)

bull If metadata is added to features the precision will be still up

ndash How to add and train metadata at low cost

bull Desire to shrink the model without loss of precision

ndash Too large for application (gt100MB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 67

References

bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定

bull [Okanohara+ 09] Text Categorization with All Substring Features

bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs

bull [Cavnar+ 94] N-Gram-Based Text Categorization

bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 68

Page 59: Short Text Language Detection with Infinity-Gram

Usage (4) Detect Language

bull ldigpy -m [model] [test data]

ndash Detect languages of test data and output

its result and summary

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 61

Data Format

bull Training and test data

ndash [correct label]yent[meta data]yent[text]

en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD

62

Usage (5) Estimation Tool

bull serverpy -m [model] -p [port number]

ndash Open httplocalhost[port] after it is executed

ndash Output their language probabilities contained features and their parameters for a text inputed in the text area

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 63

Estimation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus

As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size

64

language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992

total 166711 165053 9901 922 974

Estimation for LIGA dataset

bull Estimate using LIGA[Tromp+ 11] dataset

with 9066 tweets for 6 languages

ndash httpwwwwintuenl~mpechenprojectssmm

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Use 19 language model

65

Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996

total 9066 8992 992

Estimation for Europarl Dataset

Only supported languages for ldig

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 66

ldig langdetect CLDlanguage size correct rate correct rate correct rate

bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993

total 21000 13957 997 20850 993 20814 991

Conclusions

bull Language detector using maximal substring model

ndash Detect over 99 accuracy for 19 languages

ndash langdetect with tweet corpus even has 97 accuracy

bull If the corpus is maintained the precision will be still up

ndash There are still many mistakes (in particular da and no)

bull If metadata is added to features the precision will be still up

ndash How to add and train metadata at low cost

bull Desire to shrink the model without loss of precision

ndash Too large for application (gt100MB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 67

References

bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定

bull [Okanohara+ 09] Text Categorization with All Substring Features

bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs

bull [Cavnar+ 94] N-Gram-Based Text Categorization

bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 68

Page 60: Short Text Language Detection with Infinity-Gram

Data Format

bull Training and test data

ndash [correct label]yent[meta data]yent[text]

en u should just enjoy ur vacation sadly en D im online but you arent RT that much en im gettin attacked for a tweet LOOOOOOOOOOOOOOOOL

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

ca [status ID] [datetime] [userID] [language of UI] xxx xDDD no mextranya Tal volta haguera segut millor per a la humanitat que no lhaguera vist you know xDD

62

Usage (5) Estimation Tool

bull serverpy -m [model] -p [port number]

ndash Open httplocalhost[port] after it is executed

ndash Output their language probabilities contained features and their parameters for a text inputed in the text area

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 63

Estimation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus

As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size

64

language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992

total 166711 165053 9901 922 974

Estimation for LIGA dataset

bull Estimate using LIGA[Tromp+ 11] dataset

with 9066 tweets for 6 languages

ndash httpwwwwintuenl~mpechenprojectssmm

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Use 19 language model

65

Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996

total 9066 8992 992

Estimation for Europarl Dataset

Only supported languages for ldig

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 66

ldig langdetect CLDlanguage size correct rate correct rate correct rate

bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993

total 21000 13957 997 20850 993 20814 991

Conclusions

bull Language detector using maximal substring model

ndash Detect over 99 accuracy for 19 languages

ndash langdetect with tweet corpus even has 97 accuracy

bull If the corpus is maintained the precision will be still up

ndash There are still many mistakes (in particular da and no)

bull If metadata is added to features the precision will be still up

ndash How to add and train metadata at low cost

bull Desire to shrink the model without loss of precision

ndash Too large for application (gt100MB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 67

References

bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定

bull [Okanohara+ 09] Text Categorization with All Substring Features

bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs

bull [Cavnar+ 94] N-Gram-Based Text Categorization

bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 68

Page 61: Short Text Language Detection with Infinity-Gram

Usage (5) Estimation Tool

bull serverpy -m [model] -p [port number]

ndash Open httplocalhost[port] after it is executed

ndash Output their language probabilities contained features and their parameters for a text inputed in the text area

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 63

Estimation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus

As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size

64

language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992

total 166711 165053 9901 922 974

Estimation for LIGA dataset

bull Estimate using LIGA[Tromp+ 11] dataset

with 9066 tweets for 6 languages

ndash httpwwwwintuenl~mpechenprojectssmm

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Use 19 language model

65

Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996

total 9066 8992 992

Estimation for Europarl Dataset

Only supported languages for ldig

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 66

ldig langdetect CLDlanguage size correct rate correct rate correct rate

bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993

total 21000 13957 997 20850 993 20814 991

Conclusions

bull Language detector using maximal substring model

ndash Detect over 99 accuracy for 19 languages

ndash langdetect with tweet corpus even has 97 accuracy

bull If the corpus is maintained the precision will be still up

ndash There are still many mistakes (in particular da and no)

bull If metadata is added to features the precision will be still up

ndash How to add and train metadata at low cost

bull Desire to shrink the model without loss of precision

ndash Too large for application (gt100MB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 67

References

bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定

bull [Okanohara+ 09] Text Categorization with All Substring Features

bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs

bull [Cavnar+ 94] N-Gram-Based Text Categorization

bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 68

Page 62: Short Text Language Detection with Infinity-Gram

Estimation

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

LD53 = langdetect + standard bundled profiles LDsm = langdetect + profiles based on twitter corpus

As a text with maximum probability lt 06 is treated undetectablely the sum of detect is less than the sum of size

64

language size detect correct precision recall LD53 LDsmca Catalan 5093 4923 4857 9866 9537 953 970cs Czech 7681 7668 7663 9993 9977 963 997da Dannish 5516 5472 5310 9704 9627 945 924de German 10060 10069 10006 9937 9946 866 938en English 10162 10133 10029 9897 9869 883 950es Spanish 10244 10284 10120 9841 9879 915 960fi Finnish 7051 7038 7024 9980 9962 989 996fr French 10074 10134 10051 9918 9977 950 981hu Hungarian 4904 4892 4858 9930 9906 858 955id Indonesian 10178 10225 10160 9936 9982 897 989it Italian 10143 10205 10103 9900 9961 962 980nl Dutch 10005 9916 9858 9942 9853 695 974no Norwegian 8504 8432 8201 9726 9644 960 963pl Polish 10151 10149 10130 9981 9979 980 997pt Portuguese 10212 10201 10119 9920 9909 880 969ro Romanian 5913 5867 5850 9971 9893 928 974sv Swedish 10025 10093 9942 9850 9917 960 979tr Turkish 10308 10317 10298 9982 9990 976 995vi Vietnamese 10487 10480 10474 9994 9988 987 992

total 166711 165053 9901 922 974

Estimation for LIGA dataset

bull Estimate using LIGA[Tromp+ 11] dataset

with 9066 tweets for 6 languages

ndash httpwwwwintuenl~mpechenprojectssmm

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Use 19 language model

65

Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996

total 9066 8992 992

Estimation for Europarl Dataset

Only supported languages for ldig

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 66

ldig langdetect CLDlanguage size correct rate correct rate correct rate

bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993

total 21000 13957 997 20850 993 20814 991

Conclusions

bull Language detector using maximal substring model

ndash Detect over 99 accuracy for 19 languages

ndash langdetect with tweet corpus even has 97 accuracy

bull If the corpus is maintained the precision will be still up

ndash There are still many mistakes (in particular da and no)

bull If metadata is added to features the precision will be still up

ndash How to add and train metadata at low cost

bull Desire to shrink the model without loss of precision

ndash Too large for application (gt100MB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 67

References

bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定

bull [Okanohara+ 09] Text Categorization with All Substring Features

bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs

bull [Cavnar+ 94] N-Gram-Based Text Categorization

bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 68

Page 63: Short Text Language Detection with Infinity-Gram

Estimation for LIGA dataset

bull Estimate using LIGA[Tromp+ 11] dataset

with 9066 tweets for 6 languages

ndash httpwwwwintuenl~mpechenprojectssmm

Short Text Language Detection with Infinity-Gram

(NAIST Seminar)

Use 19 language model

65

Language size detect correct precision recallde German 1479 1476 1469 995 993en English 1505 1502 1490 992 990es Spanish 1562 1548 1541 996 987fr French 1551 1549 1540 994 993it Italian 1539 1531 1528 998 993nl Dutch 1430 1429 1424 997 996

total 9066 8992 992

Estimation for Europarl Dataset

Only supported languages for ldig

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 66

ldig langdetect CLDlanguage size correct rate correct rate correct rate

bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993

total 21000 13957 997 20850 993 20814 991

Conclusions

bull Language detector using maximal substring model

ndash Detect over 99 accuracy for 19 languages

ndash langdetect with tweet corpus even has 97 accuracy

bull If the corpus is maintained the precision will be still up

ndash There are still many mistakes (in particular da and no)

bull If metadata is added to features the precision will be still up

ndash How to add and train metadata at low cost

bull Desire to shrink the model without loss of precision

ndash Too large for application (gt100MB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 67

References

bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定

bull [Okanohara+ 09] Text Categorization with All Substring Features

bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs

bull [Cavnar+ 94] N-Gram-Based Text Categorization

bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 68

Page 64: Short Text Language Detection with Infinity-Gram

Estimation for Europarl Dataset

Only supported languages for ldig

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 66

ldig langdetect CLDlanguage size correct rate correct rate correct rate

bg Bulgarian 1000 988 988 991 991cs Czech 1000 1000 1000 994 994 995 995da Dannish 1000 976 976 968 968 932 932de German 1000 999 999 998 998 1000 1000el Greek 1000 1000 1000 1000 1000en English 1000 999 999 996 996 1000 1000es Spanish 1000 1000 1000 996 996 989 989et Estonian 1000 996 996 998 998fi Finnish 1000 997 997 998 998 1000 1000fr French 1000 999 999 999 999 992 992hu Hungarian 1000 1000 1000 999 999 999 999it Italian 1000 999 999 999 999 996 996lt Lithuanian 1000 997 997 999 999lv Latvian 1000 999 999 998 998nl Dutch 1000 1000 1000 974 974 995 995pl Polish 1000 998 998 999 999 997 997pt Portuguese 1000 995 995 996 996 989 989ro Romanian 1000 1000 1000 999 999 998 998sk Slovak 1000 988 988 990 990sl Slovene 1000 976 976 963 963sv Swedish 1000 995 995 991 991 993 993

total 21000 13957 997 20850 993 20814 991

Conclusions

bull Language detector using maximal substring model

ndash Detect over 99 accuracy for 19 languages

ndash langdetect with tweet corpus even has 97 accuracy

bull If the corpus is maintained the precision will be still up

ndash There are still many mistakes (in particular da and no)

bull If metadata is added to features the precision will be still up

ndash How to add and train metadata at low cost

bull Desire to shrink the model without loss of precision

ndash Too large for application (gt100MB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 67

References

bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定

bull [Okanohara+ 09] Text Categorization with All Substring Features

bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs

bull [Cavnar+ 94] N-Gram-Based Text Categorization

bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 68

Page 65: Short Text Language Detection with Infinity-Gram

Conclusions

bull Language detector using maximal substring model

ndash Detect over 99 accuracy for 19 languages

ndash langdetect with tweet corpus even has 97 accuracy

bull If the corpus is maintained the precision will be still up

ndash There are still many mistakes (in particular da and no)

bull If metadata is added to features the precision will be still up

ndash How to add and train metadata at low cost

bull Desire to shrink the model without loss of precision

ndash Too large for application (gt100MB)

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 67

References

bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定

bull [Okanohara+ 09] Text Categorization with All Substring Features

bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs

bull [Cavnar+ 94] N-Gram-Based Text Categorization

bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 68

Page 66: Short Text Language Detection with Infinity-Gram

References

bull [中谷 NLP12] 極大部分文字列を使った twitter 言語判定

bull [Okanohara+ 09] Text Categorization with All Substring Features

bull [Brody+ 11] Cooooooooooooooollllllllllllll Using Word Lengthening to Detect Sentiment in Microblogs

bull [Cavnar+ 94] N-Gram-Based Text Categorization

bull [Tsuruoka+ 09] Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty

Short Text Language Detection with Infinity-Gram

(NAIST Seminar) 68