information distance for voice recognition systems

8/13/2019 Information Distance for voice recognition systems

1/16

Zhang X, Hao Y, Zhu XY et al. New information distance measure and its application in question answering system.JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 23(4): 557572 July 2008

New Information Distance Measure and Its Application in Question

Answering SystemXian Zhang 1 ( ), Yu Hao 1 ( ), Xiao-Yan Zhu 1 ( ), and Ming Li 2, ( )

1 Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China 2 David R. Cheriton School of Computer Science, University of Waterloo, Waterloo, Ontario N2L 3G1, Canada

E-mail: [email protected]; [email protected]; [email protected]; [email protected]

Received November 13, 2007; revised April 9, 2008.

Abstract In a question answering (QA) system, the fundamental problem is how to measure the distance betweena question and an answer, hence ranking different answers. We demonstrate that such a distance can be precisely and

mathematically dened. Not only such a denition is possible, it is actually provably better than any other feasible denitions.Not only such an ultimate denition is possible, but also it can be conveniently and fruitfully applied to construct a QAsystem. We have built such a system QUANTA. Extensive experiments are conducted to justify the new theory.

Keywords information distance, normalized information distance, question answering system

1 Introduction

The Internet embodies every aspect of the humanknowledge, from science to art, from literature to his-tory, from medicine to entertainment, from travel toshopping, from our daily lives to international politics,

from the elite to the hoi polloi, from the origin of theuniverse to the end of it, usually truthful but sometimesmisleading, all languages, all cultures, and all people.

Fundamentally, one of the obstacles that stand be-tween us and this universal but unstructured knowledgeis a metric that measures what piece of informationis the closest to a proper answer to a question. Thispiece of information may be an article, a webpage, aparagraph, a phrase, or just a word, or an approximateoccurrence of any of the above.

In the data mining community, some have arguedthat no such (universal) metric exists. Tan et al. [1]studied an exhaustive list of 21 measures and concludedthat (among these measures) there is no measure thatis consistently better than others in all application do-mains.

Over the past decade, it has been the goal of our(the last author and his colleagues) long term effortto develop precisely such a metric of an informationdistance [2 4] that is consistently and provably betterthan all other reasonable metrics in all applicationdomains. These include all metrics listed by Tan et al. [1]

that satisfy distance metric requirements and whenthey are normalized to the range of 0 and 1 to be com-pared with the normalized information distance. Seealso [5]. This paper may be considered as a part of that continued effort.

The theory has been accepted and further studiedby the theoretical community [6 12] . It has also beenpartially accepted by the data mining community [13] .Keogh, Lonardi, and Ratanamahatana [13] compareda variant of our approach in [3] to 51 measuresfrom 7 data mining related conferences includingSIGKDD, SIGMOD, ICDM, ICDE, SSDB, VLDB,PKDD, PAKDD, and have concluded that our infor-mation distance/compression based method was su-perior to all these parameter-laden methods on theirbenchmark data. In the meantime, the theory hasfound dozens of applications in many elds [14 31] fromweather broadcasting to software engineering, and tobioinformatics.

A complete list of references is in the 3rd edition of [5].

In practical QA systems, when y contains a lotof information irrelevant to x, such information of-ten overwhelm the information distance ( D max =max{K (x |y), K (y|x )}) and the normalized informationdistance ( dmax = max {

K (x | y ) ,K (y | x )}max {K (x ) ,K (y )} ) between x and

y. It is desirable to develop a more suitable variant

Regular Paper Corresponding Author

This work is supported by the National Natural Science Foundation of China under Grant Nos. 60572084 and 60621062.


2/16

558 J. Comput. Sci. & Technol., July 2008, Vol.23, No.4

of information distance to deal with this partial infor-mation distance problem. In partial matching, trian-gle inequality usually do not hold. The neighborhooddensity requirements in [2] usually do not hold, either.Often popular concepts in the Internet have an enor-mous number of neighbors. This calls for developinga new semi-distance for this situation. This distanceis the min distance D min (x, y ) = min {K (x ), K (y)} anddmin = min {

K (x |y ) ,K (y | x )}min {K (x ) ,K (y )}

[32 ,33] .This paper focuses on how to use dmax and dmin . In

the past, the theory ( dmax ) has been approximated bytext-compression [3,4,12] , or by statistics and Shannon-Fano code [17] . QA systems present a unique chance thatwe unify these two and other approaches. We then de-scribe a prototype QA system QUANTA based on ournew theory. Extensive experiments show QUANTA hasgood potentials to become a practical system.

2 Preliminaries

2.1 Kolmogorov Complexity

Kolmogorov complexity was introduced in the 1960sby R. Solomonoff, A.N. Kolmogorov and G. Chaitin [5].It denes randomness of an individual (nite or in-nite) string. Kolmogorov complexity has been widelyaccepted as an information theory for individual objectsparallel to that of Shannons information theory which

is dened on an ensemble of objects. It has also foundmany applications in computer science such as averagecase analysis of algorithms [5]. Fix a universal Turingmachine U . The Kolmogorov complexity of a binarystring x condition to another binary string y, K U (x |y),is the length of the shortest (prex-free) program for U that outputs x with input y. It can be shown that fora different universal Turing machine U , for all x, y

K U (x |y) = K U (x |y) + C,

where the constant C depends only on U . Thus we sim-

ply write K U (x |y) as K (x |y). We write K (x | ), where is the empty string, as K (x ). For formal denitions anda comprehensive study of Kolmogorov complexity [5].

2.2 Information Distance

In the classical Newtons world, we know how tomeasure physical distances. Today, we live in an in-formation society. Can we similarly measure the in-formational distance in the cyber space between twoobjects: two documents, two letters, two emails, twomusic scores, two languages, two programs, two pic-tures, two systems, two genomes, or between a question

and an answer? Such a measurement should not be ap-plication dependent. Just like in the classical world,we do not measure distances sometimes by the amountof time a bird ies and sometimes by the number of pebbles lining up on the Santa Barbara beach.

A good information distance metric should not onlybe application independent but also universally mi-norize all other reasonable denitions.

The task of a universal denition of information dis-tance is elusive. Traditional distances such as the Eu-clidean distance or the Hamming distance obviously failfor even trivial examples. Tan et al. [1] have demon-strated that none of the 21 metrics used in data miningcommunity is universal. In fact, for any computabledistances, we can always nd counter examples. Fur-thermore, when we wish to adopt a metric to be the uni-versal standard of information distance, we must justifyit. It should not come out of thin air. It should not befrom a specic application. It should be as good as anydenition for any application.

From a simple and accepted assumption in ther-modynamics, over the last decade, we have de-rived such a universal information distance [2 4] anda general method to measure similarities betweentwo sequences [3]. The theory has been initially ap-plied to alignment free whole genome phylogeny [3],chain letter history [34] , language history [4,14] , plagia-rism detection [15] , and more recently to music clas-sication and clustering [16 ,18 ,31] , parameter-free datamining paradigm [13] , Internet knowledge discovery [17] ,protein sequence classication [22] , heart rhythm dataanalysis [28 ,35] , and many more.

What would be a good departure point for deningan information distance between two objects? To an-swer this question, in the early 1990s, we have studiedthe energy cost of conversion between two strings x andy[2]. Over half a century ago, John von Neumann hy-pothesized that performing 1 bit of information process-ing costs 1KT of energy, where K is the Boltzmannsconstant and T is the room temperature. Observingthat reversible computations can be done for free, in theearly 1960s Rolf Landauer revised von Neumanns pro-posal to hold only for irreversible computations. We [2]proposed to use the minimum energy needed to convertbetween x and y to dene their distance, as it is an ob- jective measure. Thus, if one wishes to erase string x,then he can reversibly convert it to x, xs shortest ef-fective description, then erase x. Only the process of erasing |x| bits is irreversible computation. Carryingon from this line of thinking, we [2] have dened the en-ergy to convert between x and y as the smallest numberof bits needed to convert from x to y and vice versa.


3/16

Xian Zhang et al .: QUANTA: A Novel Question Answering System 559

That is, with respect to a universal Turing machine U ,the cost of conversion between x and y is:

E (x, y ) = min {| p| : U (x, p ) = y, U (y, p ) = x}. (1)

It is clear that E (x, y ) K (x |y) + K (y |x ). From thisobservation, and some other concerns, we have denedthe sum distance [2]:

D sum (x, y ) = K (x |y) + K (y|x ).

However, the following theorem [2] was a surprise.Theorem 2.1.

E (x, y ) = max {K (x |y), K (y |x )}.

Thus, the max distance was dened[2]

D max (x, y ) = max {K (x |y), K (y|x )}.

Both distances are shown to satisfy the basic distancerequirements such as positivity, symmetricity, and tri-angle inequality [2]. It was further shown that D maxand D sum minorize (up to constant factors) all otherdistances that are computable and satisfy some reason-able density condition that within distance k to anystring x, there are at most 2 k strings. Formally, a dis-tance D is admissible if

y2 D (x,y ) 1. (2)

D max (x, y ) satises the above requirement because of Krafts Inequality (with K being the prex-free versionof Kolmogorov complexity). It was proved [2] that forany admissible computable distance D , there is a con-stant c, for all x, y ,

D max (x, y ) D (x, y ) + c. (3)

Putting it bluntly, if any such distance D discovers somesimilarity between x and y , so will D

max.

However, when we [3] tried to use information dis-tances, D sum and D max , to measure similarity betweengenomes in 1998, we had a problem. E. coli and H.inuenza are sister species but their genome lengthsdiffer greatly. The E. coli genome is about 5 megabaseswhereas the H. inuenza genome is only 1.8 megabaselong. D max or D sum between the two genomes are pre-dominated by genome length difference rather than theamount of information they share. Such a measure triv-ially classies H. inuenza to be closer to a more remotespecies of similar genome length such as A. fulgidus (2.18 megabases) than to E. coli .

In order to solve this problem, we introduced sharedinformation distance [3]:

dshare (x, y ) = 1 K (x ) K (x |y)

K (xy ) .

where K (x ) K (x |y) is mutual information between se-quences x and y[5]. We proved the basic distance metricrequirements such as symmetry and triangle inequal-ity, and have demonstrated its successful application inwhole genome phylogeny [3] and evolutionary history of chain letters [34] . It turns out that dshare is equivalent to(using Symmetry of Information Theorem, p.182, The-orem 2.8.2 in [5])

dsum = K (x |y) + K (y|x )

K (xy ) .

Thus, it can be viewed as the normalized sum distance.Hence, it becomes natural to normalize the optimal maxdistance [4]:

dmax (x, y ) = max{K (x |y), K (y |x )}

max {K (x ), K (y)} . (4)

The distance dmax (x, y ) was called the normalized in-formation distance and its metricity properties wereproved similar to that of normalized sum distance [3].

2.3 Min Distance

We wish to develop a theory of information distancebetween two concepts or between a query and an an-swer word/phrase for the purpose of developing a QAsystem. The new theory needs to solve three practicalproblems that trouble the dmax (x, y ) theory.

The rst problem is how to remove the impact of irrelevant information. Consider the following QA ex-ample. Which city is Lake Washington by? (Ques-tion 1536 in Appendix.) There are several cities aroundLake Washington: Seattle, Kirkland, and Bellevue,which are all good answers. Although the most popu-

lar answer Seattle shares the most related informationwith Lake Washington, it also contains overwhelminglyirrelevant information not related to the lake. Thus thedmax measure tends to choose a city with higher com-plexity (lower probability) such as Bellevue. Can weremove the irrelevant information in a coherent theoryand give the most popular city Seattle a chance?

The second problem is: should an information dis-tance really satisfy the triangle inequality? Imagineyou and a stranger sitting next to you on a plane ndout you share a common friend Joe Smith. The dis-tance between you and the stranger gets suddenlymuch closer via Joe Smith! Consider a QA problem:


4/16


the concept of Marilyn Monroe is pretty far from theconcept president. However, Marilyn Monroe is veryclose to JFK and JFK is very close to the con-cept of president. In the academic world, this phe-nomenon is reected by the Erd os number. We allfeel closely related via a third person Paul Erd os. Thinkabout your rst date, did you cleverly nd a conversa-tion topic that subtly drew you and her closer?

An information distance must reect what wethink to be similar. And what we think to be similarapparently does not really satisfy triangle inequality.

Fagin and Stockmeyer gave an example of partialpattern matching where the triangle-inequality does nothold [36] . Veltkamp puts it vividly [37] : under partialmatching, the distance between a man and a horse islarger than the sum of the distances between a manand a centaur and between a centaur and a horse, re-spectively. The QA problems often depend on partialinformation, and is in similar situation as partial pat-tern matching.

Some objects are popular, such as Seattle mentionedabove, and they are close to many other concepts. Tomodel this phenomenon properly is our third problem.We need to relax the neighborhood constraints of (2) toallow some selected (very few) elements to have muchdenser neighborhoods. In fact, in the D sum (x, y ) dis-tance, we have already observed similar phenomenonsince many years ago, and proved a theorem abouttough guys having fewer neighbors in [5] (p.548, The-orem 8.3.8). Note that this third problem is closely re-lated to the rst problem: only when we allow a fewpopular objects to have very dense neighborhoods, it isthen possible that they are selected more often.

It turns out that all three problems can naturallybe taken care of in one theory [32 ,33] . Let us go backto the starting point of information distance. In (1),we asked for the smallest number of bits that must beused to convert between x and y. Keeping the originalmotivation in mind: some information in x or y is notrelevant to this conversion, they can be kept aside in

this conversion process. We thus dene: with respectto a universal Turing machine U , the cost of conversionbetween x and y is:

E min (x, y ) = min {| p| : U (x,p,r ) = y, U (y,p,q ) = x,| p| + |q | + |r | E (x, y )}. (5)

To interpret, the above denition separates r as theinformation for x and q as the information for y. De-ne D min (x, y ) = E min (x, y ). In all of the followingtheorems and proofs, we omit O (log( |x | + |y|)) factors.

Theorem 2.2. D min (x, y ) = min {K (x |y), K (y |x )}.See [33] for the full proof.

Observe the following interesting phenomena. The extra information q in the proof of Theo-

rem 2.2 contains no information about x , it is the irrel-evant information in y , in a minimal sense.

While D min (x, y ) is symmetric and positive, it canbe easily proved that it does not satisfy the triangleinequality.

D min (x, y ) satises (2) only for random xs. Thisis perhaps not a surprise as D sum (x, y ) = D min (x, y ) +D max (x, y ). In the new metric D min , good guys (Kol-mogorov simple objects) have even more neighbors thanD sum .

The above three properties naturally co-exist in thenew theory. We can now normalize D min .

dmin (x, y ) = min{K (x |y), K (y |x )}

min{K (x ), K (y)} . (6)

Observe again that dmin (x, y ) is symmetric and posi-tive, but it does not satisfy triangle inequality [32] .

While it is clear that D min (x, y ) D max (x, y ), it isnot clear if such a relationship would hold for dmin vs.dmax . The following theorem shows that this holds afterall.

Theorem 2.3. For all x, y , dmin (x, y ) dmax (x, y ).See [33] for the proof.

3 How to Use d min and d max

Both dmin and dmax are implemented in our QA sys-tem. dmax takes care of the more balanced matchingsituations and dmin is designed for partial informationmatches for in-balanced situations. dmin and dmax arenot computable. They must be approximated heuris-tically. Since these distances are universal, theoreti-cally, all computable distances, including all those in [1],and compression methods in [3] and frequency countsin [17], are all legitimate approximations. It has beenobserved by many people that combining a strategy of searching more structured data with a statistical strat-egy of searching unstructured data is important [38 40] .The distances dmin and dmax now provide a unicationof all the methods. By denition, dmin and dmax are nei-ther statistical methods nor structured pattern match-ing methods, they are both, and more. They provide anatural way of combining these methods.

Words are concepts encoded according to some lan-guage. The concept the top direction of the earth isencoded as north with 5 letters in English, encodedas nord with 4 letters in French, and with bei 1character in Chinese. Now, let us forget about such a priori encoding. Let us treat each word (or phrase) on


5/16


the Internet as a placeholder for the corresponding con-cept. Given a query, we want to nd an answer thathas the smallest information distance ( dmin or dmax )to this query. The K terms in dmin or dmax can beapproximated in several ways as the following.

1) We can use an approximate pattern matchingwith the query sentence, and obtain a short descrip-tion, similar to [3].

2) We can use frequency count, and use Shannon-Fano code (P.67, Example 1.11.2 in [5]) to encode aphrase which occurs in probability p in approximately log p bits to obtain a short description. This ap-proximation method was rst proposed by Cilibrasi andVit anyi [17] .

3) Any of the methods, after being normalized, in[1].

4) Mixed usage of above 1)3) for the K terms indmin or dmax since we only need to choose the shortestencoding. No metric listed in [1] or in the literature canbe naturally used this way.

Thus different methods of approximating dmax anddmin are unied under one roof of the information dis-tance.

This might be a proper point to compare dminand dmax with the existing measures listed in [1].D min may be considered to be close to Condence,max{P (B |A), P (A |B )}, in [1], if we insist on approxi-mating K terms by statistics. Neither dmin nor dmaxcorresponds to any measures listed in [1], even re-stricted to pure probabilistic sense via Shannon-Fanoencoding. In any case, other than being proved to beuniversal theoretically, the practical advantage of ournormalized information distance measures is that theyare neither pure statistical measures (like the 21 mea-sures in [1]) nor pure pattern-based measures. Theyare both, and more, naturally. They are very easy touse, and with a clear criteria: encoding length. This isparticularly convenient to a QA system.

The measures of dmax and dmin need to be extendedto conditional versions to be used in our system. Dene:

dmax (x, y |c) = max{K (x |y, c ), K (y |x, c )}

max {K (x |c), K (y|c)} , (7)

dmin (x, y |c) = min{K (x |y, c ), K (y |x, c )}

min{K (x |c), K (y|c)} . (8)

Here c is the conditional sequence that is given forfree to compute from x to y and from y to x.

This is important to a QA system. For example, fora polysemous word fan, Table 1 shows the dmax dis-tance from fan to CPU and from fan to star,under three conditions: the empty condition, temper-ature, and Hollywood.

Table 1. Distances of Polysemous Word Fan

x y Condition dmaxFan CPU 0.6076

Fan Star 0.5832Fan CPU Temperature 0.3527Fan Star Temperature 0.6916Fan CPU Hollywood 0.8258Fan Star Hollywood 0.6598

All theorems proved for all variants of (normalized)information distance still hold under condition c. Thisseemingly innocent denition actually faces some com-plications in practice. For example, conditions x ,y,care sometimes not given as separate entities. The con-dition c is often mixed in x and y. That is, we are oftenonly given some encodings of x condition on c and ycondition on c. At least we can show that for somereasonable encodings, the conditional version of dmaxand dmin are still well-dened, as the following theoremshows.

Theorem 3.1. For all sequences x, y,c , there exist x c and yc , where x c is the shortest program computing x from c and yc is the shortest program computing y from c, such that K (x |y, c ) = K (x c |yc , c) modulo an additive O (1) term. Thus, up to an additive O(1) term,

dmax (x c , yc |c) = dmax (x, y |c);dmin (x c , yc |c) = dmin (x, y |c).

Proof. In [10], Muchnik proved that, there exist xcand yc , such that K (x c |x ) = O(1), K (yc |y) = O(1),K (x |x c , c) = O (1) and K (y |yc , c) = O (1), then

K (x |y, c ) K (x c |y, c ) + K (x |x c , c) = K (x c |y, c ) + O (1)K (x c |yc , c) + K (y, c |yc , c) + O (1)

= K (x c |yc , c) + O (1) .

At the same time,

K (x c |yc , c) K (x |yc , c) + K (x c |x ) = K (x |yc , c) + O (1)K

(x

|y, c

) +K

(y

c, c

|y, c

) +O

(1)= K (x |y, c ) + O (1) .

Thus, K (x |y, c ) = K (x c |yc , c) + O (1). We know that the information distance is

universal [32] , that is, if x and y are close under anydistance measure, they are close under the measureof information distance. However, it is not clear yethow to nd out such closeness in traditional informa-tion distance theory. Now the conditional informationdistance provides a possible solution. As Table 1 shows,the conditional information distance reects the vari-ance of relationship between concepts with different


6/16


knowledge (or condition) provided. Fig.1 gives a moreinterpretable explanation: the condition c could mapthe original concepts x and y into different xc and yc ,and thus the variant closeness could be reected bythe distance between x c and yc , still under certain con-dition c. However, this time the condition c is not soimportant since xc and yc should have eliminated allthe information in c.

Fig.1. Conditional information distances under different condi-tions cs.

In our QA system, we aim to calculate the relation-ship between the question and the answers under in-

formation distance measure. Direct calculation of theunconditional distance is both difficult and non-exible,while we nd it possible and convenient to estimate theconditional information distance between the focus of the question and the answers, under certain context. Asexplained previously, different conditions lead to differ-ent distances, so with the most proper condition and thenearest distance, the best answer can be identied outof previously determined candidates. Obviously, thegeneration of the condition is the key part. We havebuilt a well revised system in which proper conditionscan be generated according to the original question,

which will be introduced in detail in Section 4.A note on D min and D max vs. dmin and dmax : the

choice of using the normalized information distances,dmin and dmax , was because they give more practicalranking. We have given one such example of genome-genome comparison in Subsection 2.2. Consider theexample of QA: Who is the greatest scientist of all?The statement Newton is the greatest scientist of allhas 7 precise matches on Internet indexed by Google.Einstein is the greatest scientist of all has 1 precisematch; whereas God is the greatest scientist of allhas 27 precise matches. Normalized answer is Newton,

and unnormalized answer is God.

4 QA System

We have described a complete theory: from a singlethermodynamic principle to universal metrics dmin anddmax , and to methods of how they can be used. We arenow ready to apply our theory to the development of aQA system.

Current search engines return a list of web pages,pushing the burden of nding the answer to a user.This is not satisfactory, as a user may not want to ormay be unable to (a cell phone user), and should not berequired to go through the pages to look for an answer.

Despite of their usefulness and successes, theWikipedia type of approaches or human annotatedknowledge bases have inherent disadvantages as thefollowing.

The knowledge is dynamic and rapidly increasing.Manually maintaining such a massive knowledge basesystematically and consistently is difficult.

The authors writing the Wikipedia entries can belimited and biased.

Wikipedia may not cover up-to-date events andthe less popular topics immediately.

To overcome these problems, a QA system is ex-pected. The goal is to extract succinct and correctanswers from the Internet to a question given by anEnglish sentence. An elegant system for this purposeis the ARANEA system by Lin and Katz [41] . Similarsystems at LCC (QAS), Powerset.com, ask.com, brain-boost.com are under development.

Guided by the theory described in this paper, wehave implemented a prototype of QUestion-ANswer-TAvern (QUANTA) system. Initial experiments showhigh potentials of the QUANTA system. It an-swers many questions better than Google.com, ask.com,ARANEA, and START of MIT.

The idea of using web information is by no meansnew, and combining the strategies of searching struc-

tured data and unstructured data has also been stud-ied previously (See [3840] for example). What is newin our system is a natural integration of various mea-sures under one roof: the information distance. Theuniversal information distance gives a clean, natural,and provably best strategy to combine the ranks of theanswers.

QUANTA consists of modules of preprocessing,query formulation, document and passage retrieval,candidate generation, and scoring to generate the nalexact answer, as shown in Fig.2. We briey describethe main modules below.


7/16


Fig.2. QUANTA system architecture.

4.1 Preprocessing

We rst process the original question with the help

of NLP technique, heuristic methods and classicationtechnique approaches in Preprocessing stage, includ-ing Question normalization, Question type classi-cation, Syntactic structure classication, POS tag-ging, NP chunking, and NER modules.

Question normalization transforms the questionsto the simpler and regular forms. There could be kindsof ways to propose the same question, which would in-troduce more branches in later stages. Besides, thequestion may contain texts obviously informationless.To override these, tens of rules have been induced withtraining samples for normalization. For example, ques-tions like Tell me what ... are simply transformed toWhat ... , and Where is the location of , In what place aswell as Wheres are all normalized to Where is .

The Question type classication module deter-mines the answer type of the question, which helps l-ter lots of irrelevant candidates. We do classicationin a hybrid way: the obvious types are rst picked outby pre-dened rules, and an SVM classier, which istrained on UIUC-5500 question collection [42] with lib-svm toolkit [43] , is applied thereafter. Our question typetaxonomy is derived from UIUC taxonomy [42] , whileadaptions are taken to t the candidate ltering mod-ules.

The Part of Speech (POS) tagging is done by SSTagger [44] . However, the tagger is built mainly for nor-mal natural language sentences, where question formstake only a small part. So we have to modify the taggerto be suitable for questions.

The Noun Phrase (NP) chunking module and theNamed Entity Recognition (NER) module extract sub-structures in question text, which are quite useful inquery formulation and condition generation modules.The basic NP chunker software in [45] is used for mark-ing the noun phrases in the question. And the Stan-ford NER package [46] is selected for NER module. Thissoftware recognizes names, and then classies them intoPerson, Organization, Location, and Misc cat-egories. By keeping such a substructure as a whole, use-ful context information can improve the effectiveness of the corresponding modules.

Questions are also classied by their syntactic struc-tures. We dened ve syntactic categories as follows.

wh-word be sth. :Sample: What is the height of the tallest tree? wh-word be done. :Sample: How many people were killed in the re? wh-word do sth. :Sample: Who killed Abraham Lincoln? wh-word do subj do. :Sample: When did Wilt Chamberlain score 100points in a game? Other.

Usually, most questions are classied into the rst fourcategories. Unfortunately, because of the POS taggingerrors, some sentences may nd no proper category toassign to, which results in an Other category. Theclassication is done in two stages: rule-based patternmatching and svm-based classication [41] .

4.2 Query Formulation

In the Query formulation stage, questions in natu-ral language, after being processed by POS/NP chunk-

ing/NER modules, are transformed into web querieswhich will be sent into search engines to nd relevantdocuments on the Internet.

Searching queries are either exact match, or approx-imate match, with constraints that terms in the queryco-occur in the same sentence, neighboring sentences orthe same paragraph, or nally in the loosest case, in thesame document.

To nd the relevant documents, using Internet com-mercial search engines is a good and common choice,considering their large-scale web data. But in the mean-time, we suffer great loss in exibility of the system.The queries are formulated with different strictness to


8/16


make the maximal use of a commercial search engine.For example, for each question, the most strict queryrequires documents exactly containing most part of theoriginal text in the question, word by word, while a lessstrict query just looks for some important part of thequestion, and the most inexact query only requires thewords occur in the document, independently. In gen-eral, in QUANTA, preprocessing results are combinedto generate different levels of queries. For example,when the pattern ( wh-word be [noun chunk ] verb + ed )is found, the exact queries will be generated as ([ noun-chunk ] be verb + ed ) and ( verb [noun-chunk ]), theless strict queries are ([ noun-chunk ] be verb + ed ),and the inexact query containing just all the informa-tive words in the question. Fig.3 is an example:

Different ranks are given to queries with differentlevels of strictness, and then transferred to the corre-sponding search results later, which helps the roughranking of the candidates. Less strict queries will leadto smaller scores.

Who was the rst person to run the mile in less than fourminutes?

Query 1 (Exact query):the rst person to run the mile in less than four min-utes

Query 2 (A less strict query):the rst person to run the mile in less than fourminutes

Query 3 (Inexact query):the rst person to run the mile in less than four min-utes

Fig.3. Queries for question Who was the rst person to run themile in less than four minutes?

4.3 Document and Passage Retrieval

Commercial search engines are used as the externalresource, as previously explained. QUANTA currentlyuses commercial search engines (Google and AltaVista)as the document and passage retrieval engine. Exper-

iments show that through two search engines we getsimilar results.

The queries generated in Subsection 4.2 are sent tosearch engines to retrieve at most top 100 snippets.

These summary texts are extracted in the summaryextraction module. HTML tags and useless texts areeliminated, and the clean text of each snippet is treatedas one passage, ranked according to its source query.

4.4 Candidate Generation

In the N-gram extraction module, all possible n-grams in the retrieved passages are extracted as raw

candidates, with n = 1 , 2, 3, 4. The score of a candi-date is initialized according to the source passage.

Apparently there will be a large number of rawcandidates. Filters are implemented to remove mostof these candidates in the Type ltering module.Firstly, duplicated items are merged with the scoressummed up. Secondly, candidates in the stop word list,or begin/end with certain stop words are excluded. Fi-nally, some heuristics is applied to taking care of certaintypes of questions, which have been determined by thequestion type classication. Only specied format orenumerations, ltered by various patterns and dictio-naries, are kept.

After above processes, a simple combination moduleis performed to merge short candidates into long ones.For example, if John Wilkes Booth is found, then thecandidates such as Booth or Wilkes Booth with rel-ative low scores are integrated into the longer one, withthe scores combined. However, this mergence will nothappen if the score of the long one is too low, whichprevents the candidate from being immoderately long.

This part of QUANTA has been signicantly ben-eted from the ARANEA system of Lin and Katz [41] .With the above work done, reasonable answers will usu-ally have high (but usually not the highest) scores. Nowwe can come to the nal step: scoring the candidateswith dmax and dmin to determine the nal answer.

4.5 Candidate Scoring

As explained in Section 3, the conditional informa-tion distances dmax (x, y |c) and dmin (x, y |c) are used asthe scoring scheme.

dmax (x, y |c) = max{K (x |y, c ), K (y |x, c )}

max {K (x |c), K (y |c)} ,

dmin (x, y |c) = min{K (x |y, c ), K (y |x, c )}

min{K (x |c), K (y |c)} .

Now the work is to calculate the distance between acandidate and a certain reference object, under certain

conditions:dmax (candidate, reference |condition ),dmin (candidate, reference |condition ).

Therefore, there are two tasks in this stage: to deter-mine three items in the formulae above: 1) a candidate,an appropriate object as reference object, and reason-able conditions, and 2) to estimate the K () complexity.Fig.4 gives an example on the conditional informationdistance calculation for the question What city is LakeWashington by?. For simplicity, only dmin is shown inthe gure.


9/16


Fig.4. Sample of conditional information distance calculation.

Candidate . 1020 candidates from previous stagesare left for the nal information distance scoring.

Reference Object . In our approach, we take the keynamed entity or noun phrase, usually as the subject or

the main object of the question sentence as the refer-ence object. It is usually simple to nd it based on thenatural language processing results in the Preprocess-ing stage. There are indeed some questions which haveno proper named entity to be used as a reference object. For such questions we will use the unconditionalversion of dmax and dmin , calculating

d(question, candidate )

as the nal score. Condition . Flexible condition is constructed to

adapt for different contexts of various questions. Table2 gives some example patterns, where f indicates theposition of question focus, and c for the candidate.In practical usage, the placeholder in the pattern willbe replaced by a question focus or a candidate, and theresult will be further translated into the query form ac-ceptable for certain search engines. This is actually aexible local alignment of the condition to the sentenceof paragraph that may contain the answer. The moststrict condition (e.g., and in Table 2) is an ex-act sentence template which denes the texture of thesentence, including where the candidate or the question

focus is, while the least strict condition (e.g.,

in Ta-ble 2) is just some key words that must co-occur withthe question focus or the candidate. This is similar tothe query formulation part but is more complex.

Table 2. Sample Condition Patterns

f (was | were) killed (in | on) c (in | on) c , f (was | were) killed (in | on) c & f (was | were) killed (in | on) c & f & (was | were) killed in | on c f (was | were) killed

Note: denotes an exact match in a web query, the, in expects an exact match, too.

The following is the brief description of the module,including the identication of the question focus, thegeneration of the condition patterns, and the calcula-tion of the normalized information distance.

Splitting . Split the question into pieces of texts,which could be wh-word phrases, preposition phrases,noun phrases, or verb phrases, etc., which will be thecomponents of the condition patterns, or the questionfocus.

Question Focus Selection . Select the most propernoun phrase of the question focus. This usually comesfrom the subject part or the only object part of thesentence. Replace this with the placeholder f .

Marking Candidate Placeholder. Mark the wh-word phrase that presents the question. Replace thiswith the placeholder c .

Syntactic Transformation . Do possible syntactictransformations to the question, and break the sentenceby selectively adding or not adding quotation marks( ) to organize the previously marked pieces, in or-der to generate different strict level of patterns. Dif-ferent syntactic structures lead to different transforma-tion methods. A rank will be given to each conditionpattern, and a higher rank reects nearer relationshipbetween the question and the answer. Intuitively, themore modication is made from the original form, thelower the conditions rank is. In the mean time, themore strict the pattern is, the higher the rank is.

Distance Calculation . Replace the placeholderswith the practical question focus and the candidates,and then calculate the conditional information distanceaccording to (7) and (8).

QUANTA system combines two ways of approximat-ing K (): (a) encoding via approximate match of ques-tion and answer sentences; (b) Shannon-Fano code toencode the probabilities to approximate K ()[5] for de-tails of the encoding process. The minimum encodingvia (a) and (b) is used for each K () term in the dmaxand dmin formulae. Item (a) requires an alignmentand computing encoding overhead. Item (b) involves


10/16


nding the probabilities of occurrences of variables xand y under condition c. The powers of a structureddata search using alignment and a statistical strategyof searching unstructured data are thus organically in-tegrated.

5 Experimental Results

The standard QA test collection [47] consists of 109factoid questions, covering several domains includinghistory, geography, physics, biology, economics, fashionknowledge, etc., which are presented in various questionforms. The open source factoid QA system ARANEA(downloaded from Jimmy Lins website in 2005) is usedfor comparison, which implements the most commonlyused tdf algorithm for candidate scoring. The QA testdata comes with a knowledge base which we have notused. Both ARANEA and QUANTA use the Internetdirectly. For ARANEA, we use Google as the search en-gine. For QUANTA, we rst ran the experiment usingGoogle search engine. To prove the algorithms inde-pendence with the search engine, we carried out theexperiment using AltaVista search engine again.

The performance of our system compared withARANEA is listed in Table 3. Some widely acceptedevaluation measures, the top 1 answer percentage, andthe mean reciprocal rank (MRR), are used in Table 3.Here, MRR = 1n i (

1rank i ), in which the

1rank i is 1 if

the rst correct answer occurs in the rst position; 0.5if it rstly occurs in the second position; 0.33 for thethird, 0.25 for the fourth, 0.2 for the fth and 0 if noneof the rst ve answers is correct.

Table 3. Performance Comparison

Top 1 MRRARANEA 0.422 0.463QUANTA Google 0.697 0.722QUANTA AltaVista 0.661 0.699

In Appendix, we will provide the key elements

in the conditional information distance calculation.The Google condition and the AltaVista conditioncolumns are the automatically generated conditions inthe calculation through Google or AltaVista.

From the results, we can see that QUANTA system,which takes the advantage of the information distance,has several strong points:

Stable over the Search Engine . ARANEA has 46correct top 1 answers. The top 1 answer count is 76through the Google search engine, and 72 through theAltaVista search engine, which differ not so much. Sothe system performance is stable when changing thesearch engine.

Stable over the Condition Pattern . There are 44questions that use different conditions when changingthe search engine, however, still holding the answer cor-rect. Different search engine leads to different prefer-ence of condition patterns. However, the experimentalresults show that this does not matter very much.

Stable over the Question Focus . The Question fo-cus is not always proper in Appendix, but the answeris mostly correct. When the focus is wrongly selected,it may fail for the exact condition. However, it can getthe required hit count when the condition pattern isloose enough.

Now we can see that our algorithm is stable and ro-bust, not depending on the search engine, the conditionpattern, or the question focus very much.

Check the example provided in the beginning of thearticle, the example of question Which city is LakeWashington by? (Question 1536). dmax s answer isBellevue, a correct answer, whereas dmin s answer isSeattle, a correct and a more popular answer! Foranother example, for the question When was CERNfounded?, dmax s answer is 52 years ago, a correctanswer in 2006, whereas dmin s answer is 1954, moreaccurate and robust.

We also perform comparison between normalized ( d)and unnormalized ( D ) information distances. We haveused D max and D min for the same data set of 109 ques-tions. The performance of D

max and D

min is about 8%

worse than dmax and dmin .

6 Discussions and Conclusions

Firstly we will discuss some questions. What is the difference between information dis-

tance and some statistical measures listed in [1]? Can our information distances be simply con-

verted to some probabilistic statements/metrics?Besides being proved to be universal and the for-

mulas being different, even in the approximation sense,the terms in dmax and in dmin may be approximated byprobabilities, but also may be approximated by align-ments, and different approximations can be indepen-dently applied to different terms in one formula. It isnot one thing alone. The ultimate measure is simplydescription length. This naturally integrates differentmetrics suitable for searching structured data and un-structured data. Therefore our information distancescertainly cannot be simply explained by any probabilis-tic statements/metrics.

What is the practical difference between dmax anddmin ? This requires larger scale experimental studies.


11/16


Further study of this theory (for example, compar-ing the theory with all the other metrics listed in [1] inpractice) will require signicantly larger experiments.

Conclusions : we have further developed the theoryof information distance and solved several theoreticalproblems. We have provided a framework so that sucha theory can be fruitfully applied to the QA systems.We have actually built a QUANTA system based on ournew theory and demonstrated the potential and versa-tility of our system.

References

[1] Tan P N, Kumar V, Srivastava J. Selecting the right in-terestingness measure for association patterns. In Proc.SIGKDD02 , Edmonton, Alberta, Canada, pp.3244.

[2] Bennett C H, Gacs P, Li M, Vit anyi P, Zurek W. Informationdistance. IEEE Trans. Inform. Theory (STOC93 ), July1998, 44(4): 14071423.

[3] Li M, Badger J, Chen X, Kwong S, Kearney P, Zhang H.An information-based sequence distance and its applicationto whole mitochondrial genome phylogeny. Bioinformatics ,2001, 17(2): 149154.

[4] Li M, Chen X, Li X, Ma B, Vit anyi P. The similarity metric.IEEE Trans. Information Theory , 2004, 50(12): 32503264.

[5] Li M, Vit anyi P. An Introduction to Kolmogorov Complexityand Its Applications. 2nd Edition, Springer-Verlag, 1997.

[6] Vyugin M V. Information distance and conditional complex-ities. Theoret. Comput. Sci. , 2002, 271: 145150.

[7] Vereshchagin N K, Vyugin M V. Independent minimumlength programs to translate between given strings. Theoret.

Comput. Sci. , 2002, 271: 131143.[8] Shen A K, Vereshchagin N K. Logical operations and Kol-mogorov complexity. Theoret. Comput. Sci. , 2002, 271:125129.

[9] An A Muchnik, N Vereshchagin. Shannon entropy vs. Kol-mogorov complexity. In Porc. First International Computer Science Symposium in Russia, CSR 2006 , St. Petersburg,Russia, June 8-12, 2006, pp.281191.

[10] Muchnik An A. Conditional complexity and codes. Theoreti-cal Computer Science , 2002, 271(1): 97109.

[11] Muchnik An A, Vereshchagin N K. Logical operations andKolmogorov complexity II. In Proc. 16th Conf. Comput.Complexity , Chicago, USA, 2001, pp.256265.

[12] Chernov A V, Muchnik An A, Romashchenko A E, Shen A K,Vereshchagin N K. Upper semi-lattice of binary strings with

the relation x

is simple conditional to y

. Theoret. Comput.Sci. , 2002, 271: 6995.[13] Keogh E J, Lonardi S, Ratanamahatana C A. Towards

parameter-free data mining. In Proc. KDD2004 , Seattle,WA, USA, pp. 206215.

[14] Benedetto D, Caglioti E, Loreto V. Language trees and zip-ping. Phys. Rev. Lett. , 2002, 88(4): 048702.

[15] Chen X, Francia B, Li M, Mckinnon B, Seker A. Shared in-formation and program plagiarism detection. IEEE Trans.Information Theory , July 2004, 50(7): 15451550.

[16] R Cilibrasi, P M B Vit anyi, R de Wolf. Algorithmic clustringof music based on string compression. Comput. Music J. ,2004, 28(4): 4967.

[17] Cilibrasi R, Vit anyi P M B. The Google similarity distance.IEEE Trans. Knowledge and Data Engineering , 2007, 19(3):370383.

[18] Cuturi M, Vert J P. The context-tree kernel for strings. Neu-ral Networks , 2005, 18(4): 11111123.

[19] Emanuel K, Ravela S, Vivant E, Risi C. A combinedstatistical-deterministic approach of hurricane risk assess-

ment. Manuscript, Program in Atmospheres, Oceans, andClimate, MIT, 2005.[20] Kirk S R, Jenkins S. Information theory-based software met-

rics and obfuscation. J. Systems and Software , 2004, 72: 179186.

[21] Kraskov A, St ogbauer H, Andrzejak R G, Grassberger P. Hi-erarchical clustering using mutual information. Europhys.Lett. , 2005, 70(2): 278284.

[22] Kocsor A, Kertesz-Farkas A, Kajan L, Pongor S. Applicationof compression-based distance measures to protein sequenceclassication: A methodology study. Bioinformatics , 2006,22(4): 407412.

[23] Krasnogor N, Pelta D A. Measuring the similarity of proteinstructures by means of the universal similarity metric. Bioin- formatics , 2004, 20(7): 10151021.

[24] Taha W, Crosby S, Swadi K. A new approach to data miningfor software design. Manuscript. Rice Univ. 2006.

[25] Otu H H, Sayood K. A new sequence distance measure forphylogenetic tree construction. Bioinformatics 2003, 19(6):21222130.

[26] Pao H K, Case J. Computing entropy for ortholog detection.In Proc. Int. Conf. Comput. Intell. , Dec. 1719, 2004,pp.8992.

[27] Parry D. Use of Kolmogorov distance identication of webpage authorship, topic and domain. In Proc. Workshop on Open Source Web Inf. Retrieval , Compiegne, France, 2005,pp.4750.

[28] Santos C C, Bernardes J, Vit anyi P M B, Antunes L. Clus-tering fetal heart rate tracings by compression. In Proc. 19th IEEE Int. Symp. Computer-Based Medical Systems , SaltLake City, Utah, June 2223, 2006, pp.685690.

[29] Arbuckle T, Balaban A, Peters D K, Lawford M. Soft-ware documents: Comparison and measurement. In Proc.SEKE2007 , Boston, USA, July 911, 2007, pp.740748.

[30] Ane C, Sanderson M J. Missing the forest for the trees: Phy-logenetic compression and its implications for inferring com-plex evolutionary histories. Systematic Biology , 2005, 54(1):146157.

[31] Cilibrasi R, Vit anyi P M B, Clustering by compression. IEEE Trans. Inform. Theory , 2005, 51(4): 15231545.

[32] Zhang X, Hao Y, Zhu X, Li M. Information distance from aquestion to an answer. In Proc. 13th ACM SIGKDD , SanJose, California, USA, 2007, pp.874883.

[33] Li M. Information distance and its applications. Int. J.Found. Comput. Sci. , 2007, 18(4): 669681.

[34] Bennett C H, Li M, Ma B. Chain letters and evolutionary his-

tories. Scientic American , June 2003, feature article, 288(6):7681.[35] Siebes A, Struzik Z. Complex Data: Mining using patterns.

In Proc. the ESF Exploratory Workshop on Pattern Detec-tion and Discovery , London, 2002, pp.2435.

[36] Fagin R, Stockmeyer L. Relaxing the triangle inequality inpattern matching. Int. J. Comput. Vision , 1998, 28(3):219231.

[37] Veltkamp R C. Shape matching: Similarity measures and al-gorithms. In Proc. Int. Conf. Shape Modeling Applications ,Italy, Invited talk, 2001, pp.188197.

[38] Lin J. The web as a resource for question answering: Per-spectives and challenges. In Proc. 3rd Int. Conf. Language Resources and Evolution , Las Palmas, Spain, May, 2002.

[39] Clarke C, Cormack G V, Kemkes G, Laszlo M, Lynam T R,Terra E L, Tilker P L. Statistical selection of exact answers


12/16


(multitext experiments for TREC 2002). Report, Universityof Waterloo, 2002.

[40] Cimiano P, Staab S. Learning by googling. ACM SIGKDD Explorations Newsletter , 2004, 6(2): 2433.

[41] Lin J, Katz B. Question answering from the web usingknowledge annotation and knowledge mining techniques. InProc. 12th Int. CIKM , New Orleans, Louisiana, USA, 2003,pp.116123.

[42] Li X, Roth D. Learning question classiers. In Proc. COL-ING02 , Taipei, Taiwan, China, 2002, pp.556562.

[43] Chang C C, Lin C J. LIBSVM: A library for support vectormachines. 2001, http://www.csie.ntu.edu.tw/ cjlin/libsvm.

[44] Tsuruoka Y, Tsujii J. Bidirectional inference with theeasiest-rst strategy for tagging sequence data. In Proc.HLT/EMNLP05 , Vancouver, October 2005, pp.467474.

[45] Ramshaw L, Marcus M. Text chunking using transformation-based learning. In Proc. 3rd Workshop on Very Large Cor-pora , Cambridge, Massachusetts, USA, 1995, pp.8294.

[46] Finkel J R, Grenager T, Manning C. Incorporating non-localinformation into information extraction systems by Gibbssampling. In Proc. 43rd Annual Meeting of ACL , Michigan,USA, 2005, pp.363370.

[47] Lin J, Katz B. Building a reusable test collection for questionanswering. Journal of the American Society for Information Science and Technology , 2006, 57(7): 851861.

Xian Zhang received his B.E.degree from Tsinghua University,China in 2001. He is a Ph.D. candi-date in Department of Computer Sci-ence & Technology, Tsinghua Univer-sity, China. His research interests in-clude text mining, question answer-ing and web information extraction.

Yu Hao received his Ph.D.degree from Tsinghua University,China in 2005. He is now workingin Fujitsu Research and DevelopmentCenter, China. His research interestsinclude text mining, knowledge ac-quirement and web information ex-traction.

Xiao-Yan Zhu is a professor andthe Deputy Head of State Key Lab of Intelligent Technology and Systems,Tsinghua University. She obtainedthe Bachelors degree from Univer-sity of Science and Technology Bei- jing in 1982, the Masters degreefrom Kobe University in 1987, andthe Ph.D. degree from Nagoya Insti-tute of Technology, Japan in 1990.

She has been teaching at Tsinghua University since 1993.Her research interests include pattern recognition, neuralnetwork, machine learning, natural language processing andBioinformatics. She is a member of CCF.

Ming Li is a professor, ACM fel-low, Canada Research Chair, Tier Iin School of Computer Science, Uni-versity of Waterloo. He receivedhis Ph.D. degree in 1985 from Cor-nell University. His research interestsinclude bioinformatics, Kolmogorov

complexity and its applications, com-putational learning theory, compu-tational complexity, and design andanalysis of algorithms.

Appendix

Table A lists the automatically generated condition patterns and question focuses in conditional informationdistance calculation, on Jimmy Lins QA test collection, using either Google or AltaVista as the search engine (seethe next page). We only list the questions that at least by one search engine the correct answer is achieved.


13/16



14/16



15/16



16/16


information distance for voice recognition systems

Documents