cross-domain feature selection for language identification

Proceedings of the 5th International Joint Conference on Natural Language Processing, pages 553–561,Chiang Mai, Thailand, November 8 – 13, 2011. c©2011 AFNLP

Cross-domain Feature Selection for Language Identification

Marco Lui and Timothy BaldwinNICTA VRL

Department of Computer Science and Software EngineeringUniversity of Melbourne, VIC 3010, [email protected], [email protected]

Abstract

We show that transductive (cross-domain)learning is an important consideration inbuilding a general-purpose language iden-tification system, and develop a featureselection method that generalizes acrossdomains. Our results demonstrate thatour method provides improvements intransductive transfer learning for languageidentification. We provide an implementa-tion of the method and show that our sys-tem is faster than popular standalone lan-guage identification systems, while main-taining competitive accuracy.

1 Introduction

Language identification (LangID) is the task ofdetermining the language(s) that a text is writtenin. It is considered by some researchers to bea solved task, because previous research has re-ported near-perfect accuracy (Cavnar and Trenkle,1994). Hughes et al. (2006) elaborated a numberof simplifying assumptions that have made this thecase, and Baldwin and Lui (2010a) showed thatwhen some of these assumptions are relaxed tomake the task closer to the actuality of open-webLangID, it becomes considerably harder. In thispaper, we demonstrate that the style of evaluationused by Baldwin and Lui (2010a) performs wellin-domain but badly cross-domain, and develop anovel method for preserving high in-domain ac-curacy, while significantly boosting cross-domainaccuracy. Similarly to Baldwin and Lui (2010a),we make the simplifying assumption that all doc-uments are monolingual, despite recent work onmultilingual LangID (Baldwin and Lui, 2010b).

LangID is usually formulated as a supervisedmachine learning problem and evaluated in an of-fline setting (Cavnar and Trenkle, 1994; Baldwinand Lui, 2010a). However, its primary use is

online without any additional configuration, op-timized for maximal cross-domain accuracy. Anumber of such standalone LangID systems areavailable, notable among which is TextCat (vanNoord, 1997). TextCat has been the LangIDsolution of choice in research, and is the basis oflanguage identification/filtering in the ClueWeb09Dataset (Callan and Hoy, 2009) and Corpus-Builder (Ghani et al., 2004). Elsewhere, Googleprovides LangID as a web service via its GoogleLanguage Detect API (GoogleAPI). While it hasmuch higher accuracy than TextCat (as we showin Section 6.1), research applications contravenethe service’s terms of use, and moreover the ser-vice is rate-limited.

Our ideal system should offer the same ad-vantages as TextCat in terms of licensing andrun-time speed, while matching the open-domainaccuracy and ease-of-use of an API such asGoogleAPI. To this end, we release the op-timized final LangID system described in thispaper, and benchmark it against TextCat andGoogleAPI. Our major contributions are: (1) weshow that negative transfer occurs when training aclassifier over a combined set of LangID datasets;(2) we show that language models learned in a par-ticular domain do not always generalize to otherdomains; (3) we develop a method for extractingfeatures for LangID that are not tied to a particulardomain, and show that our method mitigates neg-ative transfer and provides improvements in cross-domain LangID; (4) we show that our method canincorporate additional languages without a penaltyin performance on existing languages; and (5) weprovide an implementation of our method that isfaster than state-of-the-art LangID systems whilemaintaining competitive accuracy.

2 Background

LangID as a computational task is usually at-tributed to Gold (1967), who sought to investigate

553

language learnability from a language theory per-spective. However, its current form is much morerecognizable in the work of Cavnar and Trenkle(1994), where the authors classified documents ac-cording to rank order statistics over byte n-gramsbetween a document and a global language pro-file. Since the 1990s, LangID has been formu-lated as a supervised machine learning task, andhas been greatly influenced by text categorizationin general. Monolingual LangID of a test docu-ment Di takes the form of a mapping onto a uniquelanguage from a closed set of languages C, i.e.I : Di → ci ∈ C.

Statistical approaches applied to LangID in-clude the use of Markov models over n-gram fre-quencies (Dunning, 1994) and dot products ofword frequency vectors (Darnashek, 1995). Ker-nel methods have also been applied to the task ofLangID (Kruengkrai et al., 2005). Linguisticallymotivated models for LangID have also been pro-posed, such as stop word list overlap (Johnson,1993), where a document is classified according toits overlap with lists for different languages. Therehas also been work on word and part of speech cor-relation (Grefenstette, 1995), cross-language to-kenisation (Giguet, 1995) and grammatical-classmodels (Dueire Lins and Goncalves, 2004).

LangID has been applied in a variety of do-mains, including USENET messages (Cavnar andTrenkle, 1994), web pages (Kikui, 1996; Mar-tins and Silva, 2005; Liu and Liang, 2008), andweb search queries (Hammarstrom, 2007; Ceylanand Kim, 2009). It has been shown to improveperformance of other tasks such as parsing (Alexet al., 2007) and multilingual text retrieval (Mc-Namee and Mayfield, 2004). It has also been usedfor gathering data for linguistic corpus creation(Ghani et al., 2004; Baldwin et al., 2006; Xia etal., 2009; Xia and Lewis, 2009), and is an im-portant area of research for supporting low-densitylanguages (Hughes et al., 2006; Abney and Bird,2010).

Transfer learning refers to the use of data fromexternal domains to improve task performance ona target domain. Pan and Yang (2010) provide asurvey, in which they define transductive transferlearning, where labels are available in source do-main(s) but not in the target domain. This corre-sponds exactly to our task, where we wish to traina classifier using language-labelled data from a va-riety of sources, and apply this classifier to target

Dataset Docs Langs Doc Length (bytes)JRC-ACQUIS 20000 22 18478.5±60836.8CLUEWEB09 20000 10 36909.0±20735.2DEBIAN 21735 89 12329.8±30902.7RCV2 20000 13 3382.7±1671.8WIKIPEDIA 20000 68 7531.3±16522.2N-EUROGOV 1500 10 17460.5±39353.4N-TCL 3174 60 2623.2±3751.9N-WIKIPEDIA 4963 67 1480.8±4063.9

Table 1: Summary of the LangID datasets

data without making any assumptions about its do-main. Pan and Yang (2010) also discuss the phe-nomenon of negative transfer, whereby includingdata from the source domain(s) results in reducedperformance in the target domain.

Other work has used the term domain adapta-tion to refer to inductive transfer learning, wherelabels are available in both the source and targetdomains, and the goal is to improve the perfor-mance in the target domain (Daume III and Marcu,2006; Daume III, 2007). Daume III and Marcu(2006) tackle this problem by using a mixturemodel, where data in a specific domain is mod-elled as coming from a mixture of domain-specificand general components, and the linkage betweendomains is achieved by sharing a single generalcomponent. In our task, we have no labels in thetarget domain, and thus know nothing about thedomain-specific component. Thus, our challengeis to build a suitable model of the general com-ponent, which we do by eliminating features thatmake minimal contribution to the general com-ponent. This approach makes the conversion ofdocuments to a standardized representation muchsimpler than a model that decomposes individualfeatures into general and domain-specific compo-nents.

3 Data Sources

For this work, we collected language-labelled de-velopment data from five sources of diverse ori-gin, and use these as the basis of our examinationof in-domain, inductive (all-domain) and transduc-tive (cross-domain) learning. We additionally usethree independent test data sets to validate the ef-fectiveness of the final methodology. Statistics ofall datasets are provided in Table 1.

3.1 Development data sets

JRC-ACQUIS: JRC-ACQUIS is an alignedmultilingual parallel corpus (Steinberger et al.,2006) totalling 463792 documents in 22 lan-

554

guages. From the corpus, we randomly selected20000 documents without replacement, maintaingthe relative skew of languages. For each docu-ment, only the text enclosed in the <body> tagswas retained.

CLUEWEB09: The ClueWeb09 dataset (Callanand Hoy, 2009) consists of about 1 billionweb pages in 10 languages.1 The language ofeach document was automatically detected usingTextCat, an implementation of the algorithm ofCavnar and Trenkle (1994). We sampled 20000instances from the dataset by selecting the first in-stance in each of 20000 files selected without re-placement. Because the language labels are auto-matically assigned, they do not constitute a truegold-standard, and we thus use CLUEWEB09 ex-clusively as a training dataset in Section 6.

WIKIPEDIA: Wikimedia provides dumps of thecomplete contents of all Wikipedia wikis.2 Indi-vidual languages have their own wiki, usually un-der the corresponding ISO 639-1 code. We ob-tained XML dumps of the wikis with valid ISO639-1 codes. From these dumps, we selected20000 pages in a skew-preserving fashion. Inorder to ensure that each language contained atleast 20 documents, we limited selection to the 68largest wikis by page count. The data we used wasobtained from July–August 2010.

RCV2: Reuters RCV23 consists of over 487000Reuters News stories in 13 languages. We ran-domly selected 20000 documents in a skew-preserving fashion.

DEBIAN: The Debian Project maintains manualtranslations of the content strings of a largenumber of software packages.4 We obtainedall translations with codes corresponding tovalid ISO 639-1 codes. This resulted in 21735language-package pairs in 89 languages.

For each dataset, we randomly divided the doc-uments into two equal-sized partitions. One parti-tion was used for selecting language features andfor training, and the other was used for testing.

1http://boston.lti.cs.cmu.edu/clueweb09/

2http://dumps.wikimedia.org/backup-index.html

3http://trec.nist.gov/data/reuters/reuters.html

4http://i18n.debian.net/material/po/unstable/main/

To distinguish between them, we label the parti-tions A and B, respectively. For example, DE-BIANA is the partition of the DEBIAN dataset usedto compute feature weights and language models,and DEBIANB is the partition used for testing. Wealso make frequent use of the union of the A par-titions across all datasets, and will refer to this asUNIONA.

3.2 Test data setsIn order to evaluate accuracy in the transductivelearning setting, we make use of N-EUROGOV,N-TCL and N-WIKIPEDIA, the three datasets de-scribed in detail by Baldwin and Lui (2010a). N-EUROGOV was sourced from the EuroGOV col-lection used in the 2005 Web-CLEF task, N-TCLwas manually sourced by the Thai ComputationalLinguistics Laboratory (TCL) in 2005 from onlinenews sources, and N-WIKIPEDIA is drawn from a2008 dump of Wikipedia, with normalization.

4 Learning Algorithms

In this work, we use a multinomial naive Bayeslearner. For brevity, we only give a short sketchof the technique; it is described in much more de-tail by McCallum and Nigam (1998). The crux ofthe method is to compute the probability that aninstance belongs to a class Ci from a given closedset C, and hence assign the most probable class toa document D, consisting of a vector of n featuresx1...xn:

c = argmaxCi∈CP (Ci|D)

Bayes’ theorem allows us to re-express this as:

c = argmaxCi∈CP (D|Ci)P (Ci)

P (D)

where P (D) is a normalizing constant indepen-dent of Ci. Thus for classification, we only needto estimate P (D|Ci) and P (Ci). C is modelledas a categorical distribution over classes, and soP (Ci) is obtained via a maximum likelihood esti-mate. In order to estimate P (D|Ci), we make thenaive assumption that each term is conditionallyindependent, hence:

P (D|Cj) =n∏

i=1

P (ti|Cj)ND,ti

ND,ti !

where ND,ti is the frequency with which term tioccurs in D.

555

0.05 0.10 0.15 0.20 0.25 0.30 0.35IG Language

0.1

0.2

0.3

0.4

0.5

0.6IG

Dom

ain

0.0

0.8

1.6

2.4

3.2

4.0

4.8

5.6

log1

0(N)

Figure 1: Scatter plot of IG for language vs. IG fordomain (byte trigrams)

The reason we select multinomial naive Bayesis that it is relatively lightweight and has beenshown to be highly accurate when combined withfeature selection (McCallum and Nigam, 1998;Manning et al., 2008). To establish the general-izability of our results, we also experimented witha nearest prototype classifier based on skew diver-gence (Lee, 2001), based on the findings of Bald-win and Lui (2010a). However, when combinedwith feature selection, we found it was consis-tently outperformed by the naive Bayes classifier,and thus omit results from this paper. We also ex-perimented with a linear-kernel SVM learner, butonce again omit results as we found it to be compa-rable in accuracy to naive Bayes when combinedwith feature selection, but much slower to retrain.

5 Feature Selection

The document representation that we use is a mix-ture of byte n-grams (Cavnar and Trenkle, 1994;Baldwin and Lui, 2010a), as it is language-neutralin that it does not make any assumptions about thelanguage or language type of each document. Inparticular, we make no assumptions about worddelimitation (e.g. via white space) in each lan-guage. Results from Baldwin and Lui (2010a) ad-ditionally suggest that explicit encoding detectionis not necessary in LangID, and that the simplebyte tokenization strategy also used in this workis superior to encoding-aware codepoint-based to-kenization. Where we have data in more than oneencoding, we do not transcode it; instead we sim-ply extract byte features across all the encodings.In practice this is not an issue as the data that we

5000 10000 15000 20000 25000 30000 35000 40000Document Frequency

�0.4

�0.3

�0.2

�0.1

0.0

0.1

0.2

0.3

Lang

Dom

ain

scor

e

0.0

0.8

1.6

2.4

3.2

4.0

4.8

5.6

log1

0(N)

Figure 2: Scatter plot of DF vs. LD (byte tri-grams)

use is mostly UTF8-encoded, with small quanti-ties of other encodings (esp. in N-TCL). We makeno further mention of encoding as it is not the fo-cus of this work.

Cavnar and Trenkle (1994) perform feature se-lection over such a mixture of n-grams, where 1 ≤n ≤ 5. Their feature selection method is based onthe frequency of terms, where the N most frequentterms for each language are retained in the globalfeature set. Related to term frequency is documentfrequency, where rather than counting individualterms in a class, we count the number of docu-ments that a particular term appears in. Featureselection based on document frequency has beenshown to be inferior to other methods. Generally ithas been found that Information Gain (IG : Quin-lan (1986)) is particularly suited to feature selec-tion in multiclass problem settings such as LangID(Yang and Pedersen, 1997; Forman, 2003).

The novel aspect of our research is that we con-sider IG of particular n-gram features along mul-tiple dimensions: (1) with respect to the set of alllanguages; (2) with respect to a given language;and (3) with respect to the domain the data was ob-tained from. We are interested in identifying fea-tures that have high IG with respect to languagebut low IG with respect to domain.

For each of 1-grams, 2-grams and 3-grams wecomputed the IG of each feature with respectto the set of all languages in our developmentdatasets (97 languages), as well as with respectto the set of 5 domains. A scatter plot of IG forlanguage vs. domain for 3-grams is presented inFigure 1. From this analysis, it is clearly visible

556

that there are two distinct groups of features: onewhere the IG for language is strongly correlatedwith that for domain, and one where the IG forlanguage is largely independent of that for domain.We are interested in identifying features in the sec-ond group, and so for each feature we compute aLD (Language-Domain) score, defined as:

LDall(t) = IGalllang(t) − IGdomain(t)

The number of features to consider grows ex-ponentially with n-gram order. This makes cal-culating the IG for higher token n-gram orderscomputationally infeasible. Yang and Pedersen(1997) found that document frequency (DF ) andIG were strongly correlated. Thus, we studied therelationship between LD and DF in our data, andfound that low DF is a good predictor of low LDscore, but not vice versa, as seen in Figure 2. AsDF is much cheaper to compute than IG , we firstidentify the 15000 features with highest DF for agiven n-gram order, and assign a LD score of 0 toall features outside this set.

We also consider an alternative formulation ofLD. Due to skews between the quantities of datafor each language, the highest-ranked features byLD tend to be biased towards the more commonlanguages. In order to mitigate this, we also con-sider a formulation whereby we compute a LDbin

score for each feature conditioned on each lan-guage l:

LDbin(t|l) = IGbinlanguage(t|l) − IGdomain(t)

In order to give better representaiton to less-frequent languages, we then select the top-scoringfeatures from each language, and define the LDbin

feature set as the union of these features.The number of features N is a parameter in

both methods that we will investigate empiricallyin Section 6.

6 Experimental Setup and Results

The results presented in Baldwin and Lui (2010a)are based on cross-validation within datasets,without considering the applicabiliy of a modellearned in one domain to LangID in another do-main. Baldwin and Lui (2010a) also presentedpreliminary results with regards to feature selec-tion, based on the Cavnar and Trenkle (1994)method. We first compare results with and withoutfeature selection using our LDbin and LDall met-rics, as well as IGall

lang—where we select the top

features according to their information gain withthe set of all languages—and finally IGbin

lang—where we select a fixed number of features per lan-guage, according to their information gain with thelanguage.

Baldwin and Lui (2010a) found that byte bi-grams performed well as a feature set for LangID,so we use them for this initial experiment. Tok-enizing across all of our datasets results in 57160unique bigrams. For IGall

lang and LDall , we con-sider the top 2000 to 10000 features, in incrementsof 1000. For the per-language metrics, we exper-iment with selecting 100 to 800 features per lan-gauge, in increments of 100.

We perform this experiment in two settings:(1) the supervised learning setting, where we usetraining and test data from the same domain; and(2) the inductive transfer learning setting, wherewe train a single classifier on UNIONA—the unionof the A partitions across the five domains—anduse this to classify partition B from each of the fivedomains in turn. We found that for IGall

lang , 10000features produced the best results; for IGbin

lang , itwas 100 features per class, corresponding to 4078features; for LDall , it was 3000 features; and forLDbin it was 200 features per class, correspond-ing to 3086 features. We report the results for thebest parametrization of each feature selection met-ric in Tables 2 and 3. In each case, we presentthe classification accuracy for the full feature set(“Full”), followed by the classification accuracyfor feature selection over the same combination oftraining and test dataset, as the ∆ over Full. Inall our experiments, the language skew present ineach domain is preserved in both the training andthe test partitions.

In the supervised case, the accuracy attained issimilar when using the full feature set as for therespective reduced feature sets. This shows thatutilizing the reduced feature sets does not harm in-domain classification accuracy, implying that weare able to preserve the key features required forclassifying the data.

In the case of inductive transfer learning, weobserve negative transfer: the greater amount oftraining data causes the accuracy to drop. This isparticularly evident over RCV2, where adding indata from the other four domains results in a dropin accuracy from 0.973 to 0.576. We find that theLD features are generally able to better mitigatethis than the IGlang features. LDbin features pro-

557

Train Eval Full IGalllang IGbin

lang LDall LDbin

JRC-ACQUISA JRC-ACQUISB 0.985 +0.000 +0.007 +0.007 +0.005DEBIANA DEBIANB 0.963 +0.003 +0.002 −0.001 −0.016RCV2A RCV2B 0.973 −0.017 −0.016 −0.003 +0.026WIKIPEDIAA WIKIPEDIAB 0.935 +0.017 +0.020 +0.030 +0.001

Table 2: In-domain supervised learning accuracy, relative to all features (“Full”)

Train Eval Full IGalllang IGbin

lang LDall LDbin

UNIONA

JRC-ACQUIS 0.931 −0.001 +0.039 +0.060 +0.062DEBIAN 0.817 −0.031 +0.049 +0.122 +0.127RCV2 0.576 −0.048 +0.129 +0.347 +0.410WIKIPEDIA 0.739 −0.150 −0.070 +0.179 +0.179

Table 3: Inductive Transfer Learning (all-domain) accuracy, relative to all features (“Full”)

vide the best accuracy over each dataset in the in-ductive transfer learning setting, increasing accu-racy over RCV2 to 0.986, exceeding the accuracyof the in-domain classification on the full featureset. We find that LDbin features can also improveaccuracy for in-domain classification, increasingaccuracy on RCV2 to a near-perfect 0.999. Theseresults validate the choice of LDbin as a suitablemetric for selecting general features for LangID.

Since our aim is to build a classifier in a trans-ductive transfer learning setting, we examined thebehaviour of the LDbin features in such a settingover our 5 datasets. For each dataset, we traineda classifier on the A partitions, and evaluated theclassifier on the B partition of each of the otherdatasets. Since each dataset covers a different setof languages, there may be languages in the eval-uation dataset that are not present in the trainingdataset. It makes no sense to attempt to clas-sify documents in these languages since we willby definition misclassify them, so the results re-sported in Table 4 are only over languages in theevaluation set that are also present in the train-ing set. As a result, caution must be exercised innaively comparing accuracy figures across differ-ent combinations of training and test datasets.

In Table 4, we see several examples of languagemodels not generalizing to other domains. For ex-ample, we see that when using all features, the lan-guage models learned from RCV2 classify datafrom the RCV2 domain with accuracy 0.973, butthis drops to 0.838, 0.889 and 0.796 in other do-mains. We also find that language models fromother domains have reduced accuracy when classi-fying RCV2 data. We find that in all these cases,using the LDbin features mitigates this loss in ac-curacy. On RCV2 in particular, use of the LDbin

features results in over 0.95 accuracy when train-ing with any of the 5 domains.

There is a number of domains where the useof the LDbin features results in a loss in accu-racy relative to the full feature set. This contrastswith our results in the inductive learning setting.We hypothesize that this is due to the LDbin fea-tures having been selected from the union of all 5datasets. While this feature set represents a gen-eral model across these 5 domains, for a specificpair of domains there will be some additional fea-tures that are strong predictors of language.

Preliminary work by Baldwin and Lui (2010a)suggested that feature selection over a mixture ofn-grams yielded promising results. This is alsosupported by Cavnar and Trenkle (1994), who usea mixture of 1- to 5-grams. We thus investigatedthe interaction between the range of n-gram ordersused for feature selection, the number of featuresselected per language and classification accuracy.Based on the results in Tables 2, 3 and 4, we usedthe LDbin metric exclusively.

We conducted a parameter search for maximumtoken order M (e.g. M = 3 would mean all 1-,2- and 3-grams) and number of features per lan-guage N . We considered 1 ≤ M ≤ 9 and100 ≤ N ≤ 800 in increments of 100 features.We tested all combinations exhaustively in a su-pervised learning task across the union of all 5datasets. We found that the best results are ob-tained for M ≥ 4 and N ≥ 300. In order to max-imize classification rate we need to minimize thenumber of features used, so we chose M = 4 andN = 300, producing a feature set consisting of7480 features.

In building a universal LangID tool, we needto quantify the effect of training on languages ex-traneous to the target domain. To investigate this,we perform experiments over the datasets of Bald-win and Lui (2010a) using two different classi-fiers: a reference classifier trained on all languages

558

TrainingTest Dataset

JRC-ACQUIS DEBIAN RCV2 WIKIPEDIA

Full LDbin Full LDbin Full LDbin Full LDbin

JRC-ACQUIS 0.985 +0.005 0.979 −0.055 0.812 +0.174 0.977 −0.026CLUEWEB09 0.981 −0.005 0.989 −0.010 0.925 +0.069 0.927 −0.039DEBIAN 0.983 −0.003 0.963 −0.016 0.689 +0.261 0.937 −0.058RCV2 0.838 +0.148 0.889 −0.003 0.973 +0.026 0.796 +0.154WIKIPEDIA 0.837 +0.141 0.863 +0.011 0.561 +0.407 0.935 +0.001

Table 4: Transductive transfer learning (cross-domain) accuracy relative to all features (“Full”)

Test Dataset langid.py TextCat TextCat (retrained) GoogleAPIAccuracy docs/s ∆Acc Slowdown ∆Acc Slowdown ∆Acc Slowdown

JRC-ACQUIS 0.991 69.8 −0.164 18.5× −0.075 11.1× +0.004 7.0×DEBIAN 0.969 94.1 −0.305 25.1× −0.129 14.2× −0.043 9.4×RCV2 0.992 146.6 −0.642 34.9× −0.922 19.1× +0.005 14.6×WIKIPEDIA 0.959 102.7 −0.210 26.2× −0.365 14.9× −0.012 10.3×N-EUROGOV 0.987 68.5 −0.046 18.1× −0.083 11.1× +0.006 6.9×N-TCL 0.904 172.1 −0.299 38.8× −0.232 22.5× +0.018 17.2×N-WIKIPEDIA 0.913 209.2 −0.207 45.9× −0.227 25.7× −0.010 20.9×

Table 5: Comparison of standalone classification tools, in terms of accuracy and speed (docu-ments/second), relative to langid.py

present in the training set, and a domain-specificclassifier trained only on languages in the test set.The reference classifier is trained on 1–4-grams,selecting 300 features per language on the full97 languages, whereas each domain-specific clas-sifier is trained only on the subset of languagespresent in the target domain.

Figure 3 shows a scatter plot of the per-languageaccuracy over the different datasets. Rather thanharming performance, we find that adding lan-guges extraneous to the target domain generallyhas no impact on accuracy over the languages inthe dataset. In fact, it occasionally positively im-pacts on accuracy (points to the left of the diago-nal), and for the rare instances where it hurts accu-racy (points to the right of the diagonal), the dif-ference is relatively modest.

6.1 Comparison to existing tools

Our ultimate interest is in building a standaloneclassifier that is fast and accurate. For comparisonto existing tools, we implemented our method asa Python module (langid.py), in the form of asingle Python file with pre-trained models for 97languages. It can act as a standalone LangID sys-tem, an embedded Python module, or a web ser-vice with an AJAX API.5

We compared the speed and accuracy of oursystem to TextCat, as well as GoogleAPI.GoogleAPI is constrained to: (1) limit the classi-

5The code is available for public download fromhttp://www.csse.unimelb.edu.au/research/lt/resources/langid/

fication rate to 10 documents/second, and (2) basethe classification only on the first 500 bytes of thedocument. These constraints are imposed by theservice’s terms of use. For TextCat, we testit: (1) off-the-shelf using the pre-trained languagemodels, and (2) after retraining it over the sametraining data as langid.py.

Table 5 shows the accuracy of each systemacross 7 test datasets, as well as the speed indocuments per second. We present absolute ac-curacy and speed for langid.py, and relativeaccuracy and slowdown for the other classifiers.The machine used to perform this experiment wasa commodity desktop-class machine, with an In-tel Q9400 4-Core CPU, 4GB of RAM and a7200RPM SATA II hard drive. The slowdown forGoogleAPI is reported based on a classificationrate of 10 documents per second.

We find that in general, langid.py is fasterand more accurate than TextCat (both off-the-shelf and re-trained). The difference is small-est over a “traditional” LangID dataset like N-EUROGOV which contains only 10 languages,and is much more pronounced over a datasetlike N-TCL which has a much larger vari-ety of languages. Comparing langid.py andGoogleAPI, the two systems are evenly matchedin accuracy, but again, langid.py is signifi-cantly faster. All differences in accuracy are sta-tistically significant (McNemar’s test, p < 0.01).

We note that the accuracy of langid.py isslightly lower on N-TCL than on other datasets.N-TCL has a high proportion of non-UTF8 doc-

559

0.0 0.2 0.4 0.6 0.8 1.0Accuracy (Domain Languages)

0.0

0.2

0.4

0.6

0.8

1.0

Accu

racy

(All

Lang

uage

s)TCLWikipediaEuroGOV

Figure 3: Per-language accuracy of a classifiertrained only on languages present in the evaluationdomain (x-axis) vs. all languages (y-axis)

Test dataset Feature selectionTextCat CT LDbin

N-EUROGOV 0.904 +0.067 +0.083N-TCL 0.672 +0.016 +0.227N-WIKIPEDIA 0.686 +0.084 +0.223

Table 6: Comparison of TextCat to a NB classi-fier using CT and LDbin for feature selection, us-ing UNIONA as the training data

uments (57%). To quantify the effect of this, wetranscoded all the documents in N-TCL to UTF8and repeated the experiment. We found that af-ter transcoding, the accuracy of langid.py in-creased from 0.904 to 0.947. This indicates thatunsupported encodings have a small impact on ac-curacy, though it is relatively minor consideringthat the majority of the data is not UTF8-encoded.

The are two key conceptual differences betweenTextCat and langid.py: feature selectionand learning algorithm. TextCat (CT ) selectsfeatures per language by term frequency, whereaslangid.py uses LDbin (the focus of this work).On the other hand, the learning algorithm usedby TextCat is a nearest-prototype method us-ing the token rank difference metric of Cavnarand Trenkle (1994), whereas langid.py usesmultinomial naive Bayes. In order to considerthese differences independently, we combine theCT feature selection with the multinomial naiveBayes learner, selecting the union of the top-300features per language by document frequency over1- to 5-grams, yielding 10846 features. For com-

parison, we also selected the top 300 features byLDbin over 1- to 5-grams, yielding 8166 fea-tures. We then trained a naive Bayes classifierover UNIONA using the respective feature sets,and used it to classify each of N-EUROGOV, N-TCL and N-WIKIPEDIA. In Table 6, we com-pare the accuracy of these two classifiers to (re-trained) TextCat. Across all datasets, LDbin

is superior to CT . Additionally, NB with CT issuperior to TextCat (which uses the same fea-ture selection strategy, with the nearest prototypelearner), although the difference here is muchsmaller. This suggests that the bulk of the im-provement of langid.py over TextCat is dueto the use of LDbin , as developed in this research.

7 Conclusion

We demonstrated the problem of negative trans-fer in training a LangID system using language-labelled data from a variety of domains. We de-veloped a method for identifying features that arestrongly predictive of language across multiple do-mains by examining the difference in informationgain of each feature with language and with thesource domain. We used this method to compilea feature set from 50,000 documents in 97 lan-guages across 5 datasets, and implemented this asa standalone LangID system using a naive Bayesclassifier. We empirically compared our systemto state-of-the-art LangID systems, and found oursystem to be faster whilst maintaining competitiveaccuracy.

AcknowledgmentsThe authors wish to thank Google for providing increased ac-cess to the Google AJAX Language Identification API for thepurposes of this research. The first author was additionally arecipient of travel support from Google Australia.

NICTA is funded by the Australian Government as rep-resented by the Department of Broadband, Communicationsand the Digital Economy and the Australian Research Coun-cil through the ICT Centre of Excellence program.

ReferencesSteven Abney and Steven Bird. 2010. The human language

project: building a Universal Corpus of the world’s lan-guages. In Proceedings of the 48th Annual Meeting of theAssociation for Computational Linguistics, pages 88–97,Uppsala, Sweden.

Beatrice Alex, Amit Dubey, and Frank Keller. 2007. Us-ing foreign inclusion detection to improve parsing perfor-mance. In Proceedings of the Joint Conference on Empir-ical Methods in Natural Language Processing and Com-putational Natural Language Learning 2007 (EMNLP-CoNLL 2007), pages 151–160, Prague, Czech Republic.

560

Timothy Baldwin and Marco Lui. 2010a. Language identifi-cation: The long and the short of the matter. In Proceed-ings of Human Language Technologies: The 11th AnnualConference of the North American Chapter of the Associ-ation for Computational Linguistics (NAACL HLT 2010),pages 229–237, Los Angeles, USA.

Timothy Baldwin and Marco Lui. 2010b. Multilingual lan-guage identification: ALTW 2010 shared task dataset.In Proceedings of the Australasian Language Technol-ogy Workshop 2010 (ALTW 2010), pages 5–7, Melbourne,Australia.

Timothy Baldwin, Steven Bird, and Baden Hughes. 2006.Collecting low-density language materials on the web.In Proceedings of the 12th Australasian Web Conference(AusWeb06). http://www.ausweb.scu.edu.au/ausweb06/edited/hughes/.

Jamie Callan and Mark Hoy, 2009. ClueWeb09 Dataset.Available at http://boston.lti.cs.cmu.edu/Data/clueweb09/.

William B. Cavnar and John M. Trenkle. 1994. N-gram-based text categorization. In Proceedings of the ThirdSymposium on Document Analysis and Information Re-trieval, Las Vegas, USA.

Hakan Ceylan and Yookyung Kim. 2009. Language iden-tification of search engine queries. In Proceedings of theJoint Conference of the 47th Annual Meeting of the ACLand the 4th International Joint Conference on NaturalLanguage Processing of the AFNLP, pages 1066–1074,Singapore.

Marc Darnashek. 1995. Gauging similarity with n-grams:Language-independent categorization of text. Science,267:843–848.

Hal Daume III and Daniel Marcu. 2006. Domain adaptationfor statistical classifiers. Journal of Artificial IntelligenceResearch, 26(1):101–126.

Hal Daume III. 2007. Frustratingly easy domain adaptation.In Proceedings of the 45th Annual Meeting of the Asso-ciation for Computational Linguistics (ACL 2007), pages256–263, Prague, Czech Republic.

Rafael Dueire Lins and Paulo Goncalves. 2004. Automaticlanguage identification of written texts. In Proceedings ofthe 2004 ACM Symposium on Applied Computing (SAC2004), pages 1128–1133, Nicosia, Cyprus.

Ted Dunning. 1994. Statistical identification of language.Technical Report MCCS 940-273, Computing ResearchLaboratory, New Mexico State University.

George Forman. 2003. An Extensive Empirical Study ofFeature Selection Metrics for Text Classification. Journalof Machine Learning Research, 3(7-8):1289–1305.

Rayid Ghani, Rosie Jones, and Dunja Mladenic. 2004.Building Minority Language Corpora by Learning to Gen-erate Web Search Queries. Knowledge and InformationSystems, 7(1):56–83.

Emmanuel Giguet. 1995. Categorization according to lan-guage: A step toward combining linguistic knowledge andstatistic learning. In Proceedings of the 4th InternationalWorkshop on Parsing Technologies (IWPT-1995), Prague,Czech Republic.

E. Mark Gold. 1967. Language identification in the limit.Information and Control, 5:447–474.

Gregory Grefenstette. 1995. Comparing two language iden-tification schemes. In Proceedings of Analisi Statistica deiDati Testuali (JADT), pages 263–268.

Harald Hammarstrom. 2007. A Fine-Grained Model forLanguage Identication. In Proceedings of Improving NonEnglish Web Searching (iNEWS07), pages 14–20.

Baden Hughes, Timothy Baldwin, Steven Bird, JeremyNicholson, and Andrew MacKinlay. 2006. Reconsider-ing language identification for written language resources.

In Proceedings of the 5th International Conference onLanguage Resources and Evaluation (LREC 2006), pages485–488, Genoa, Italy.

Stephen Johnson. 1993. Solving the problem of languagerecognition. Technical report, School of Computer Stud-ies, University of Leeds.

Genitiro Kikui. 1996. Identifying the coding system andlanguage of on-line documents on the internet. In Pro-ceedings of the 16th International Conference on Compu-tational Linguistics (COLING ’96), pages 652–657, Ky-oto, Japan.

Canasai Kruengkrai, Prapass Srichaivattana, Virach Sornlert-lamvanich, and Hitoshi Isahara. 2005. Language identifi-cation based on string kernels. In Proceedings of the 5thInternational Symposium on Communications and Infor-mation Technologies (ISCIT-2005), pages 896–899, Bei-jing, China.

Lillian Lee. 2001. On the effectiveness of the skew diver-gence for statistical language analysis. In Proceedings ofArtificial Intelligence and Statistics 2001 (AISTATS 2001),pages 65–72, Key West, USA.

Jicheng Liu and Chunyan Liang. 2008. Text Categorizationof Multilingual Web Pages in Specific Domain. In Pro-ceedings of the 12th Pacific-Asia Conference on Advancesin Knowledge Discovery and Data Mining, PAKDD’08,pages 938–944, Osaka, Japan.

Christopher D. Manning, Prabhakar Raghavan, and HinrichSchutze. 2008. Introduction to Information Retrieval.Cambridge University Press, Cambridge, UK.

Bruno Martins and Mario J. Silva. 2005. Language identi-fication in web pages. In Proceedings of the 2005 ACMSymposium on Applied Computing (SAC ’05), page 764.

Andrew McCallum and Kamal Nigam. 1998. A compari-son of event models for Naive Bayes text classification.In Proceedings of the AAAI-98 Workshop on Learning forText Categorization, Madison, USA.

Paul McNamee and James Mayfield. 2004. Character N -gram Tokenization for European Language Text Retrieval.Information Retrieval, 7(1–2):73–97.

Sinno Jialin Pan and Qiang Yang. 2010. A Survey on Trans-fer Learning. IEEE Transactions on Knowledge and DataEngineering, 22(10):1345–1359, October.

J.R. Quinlan. 1986. Induction of Decision Trees. MachineLearning, 1(1):81–106, October.

Ralf Steinberger, Bruno Pouliquen, Anna Widiger, CameliaIgnat, Tomaz Erjavec, Dan Tufis, and Daniel Varga. 2006.The JRC-Acquis: A multilingual aligned parallel corpuswith 20+ languages. In Proceedings of the 5th Interna-tional Conference on Language Resources and Evaluation(LREC’2006), Geona, Italy.

Gertjan van Noord, 1997. TextCat. Software avail-able at http://odur.let.rug.nl/˜vannoord/TextCat/.

Fei Xia and William Lewis. 2009. Applying NLP technolo-gies to the collection and enrichment of language data onthe web to aid linguistic research. In Proceedings of theEACL 2009 Workshop on Language Technology and Re-sources for Cultural Heritage, Social Sciences, Humani-ties, and Education (LaTeCH – SHELT&R 2009), pages51–59, Athens, Greece.

Fei Xia, William Lewis, and Hoifung Poon. 2009. LanguageID in the context of harvesting language data off the web.In Proceedings of the 12th Conference of the EACL (EACL2009), pages 870–878, Athens, Greece.

Yiming Yang and Jan O. Pedersen. 1997. A comparativestudy on feature selection in text categorization. In Pro-ceedings of the 14th International Conference on MachineLearning.

561

cross-domain feature selection for language identification

Documents