speech recognition system im

Upload: cj-kao

Post on 04-Jun-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/13/2019 speech recognition system IM

    1/8

    70 COMMUNICATIONS OF THE ACM | JULY 2013 | VOL. 56 | NO. 7

    contributed articles

    VOICE INP UT IS a major requirement for practicalquestion answering (QA) systems designed forsmartphones. Speech-recognition technologies arenot fully practical, however, due to fundamentalproblems (such as a noisy environment, speakerdiversity, and errors in speech). Here, we define theinformation distance between a speech-recognitionresult and a meaningful query from which we canreconstruct the intended query, implementing thisframework in our RSVP system.

    In 12 test cases covering male, female, child, adult,native, and non-native English speakers, each with

    57 to 300 questions from an independent test set of

    300 questions, RSVP on average re-duced the number of errors by 16% for

    native speakers and by 30% for non-native speakers over the best-knownspeech-recognition software. The idea

    was then extended to translation inthe QA domain.

    In our project, which is supportedby Canadas International Develop-ment Research Centre (http://www.idrc.ca/), we built a voice-enabledcross-language QA search engine forcellphone users in the developing

    world. Using voice input, a QA system

    would be a convenient tool for peoplewho do not write, for people with im-paired vision, and for children whomight wish their Talking Tom or R2-D2 really could talk.

    The quality of todays speech-rec-ognition technologies, exemplifiedby systems from Google, Microsoft,and Nuance does not fully meet suchneeds for several reasons:

    Noisy environments in commonaudio situations;1

    Speech variations, as in, say,

    adults vs. children, native speakersvs. non-native speakers, and femalevs. male, especially when individualvoice-input training is not possible, asin our case; and

    Incorrect and incomplete sen-tences; even customized speech-rec-ognition systems would fail due tocoughing, breaks, corrections, andthe inability to distinguish between,say, sailfish and sale fish.

    Speech-recognition systems can betrained for a fixed command set of

    up to 10,000 items, a paradigm that

    Information

    DistanceBetween WhatI Said and

    What It Heard

    DOI:10.1145/2483852.2483869

    The RSVP voice-recognition search engineimproves speech recognition and translationaccuracy in question answering.

    BY YANG TANG, DI WANG, JING BAI, XIAOYAN ZHU, AND MING LI

    key insights

    Focusing on an infinite but highlystructured domain (such as QA),

    we significantly improve general-purpose

    speech recognition results and

    general-purpose translation results.

    Assembling a large amount of Internet

    data is key to helping us achieve these

    goals; in the highly structured QA domain,

    we collected millions of human-asked

    questions covering 99% of question types.

    RSVP development is guided by a theoryinvolving information distance.

  • 8/13/2019 speech recognition system IM

    2/8

    JULY 2013 | VOL. 56 | NO. 7 | COMMUNICATIONS OF THE ACM 71

    does not work for general speech rec-ognition. We consider a new paradigm:speech recognition limited to the QAdomain covering an unlimited numberof questions; hence it cannot be trainedas a fixed command set domain. How-ever, our QA domain is highly struc-tured and reflects clear patterns.

    We use information on the Inter-net in the QA domain to find all pos-sible question patterns, then use it to

    correct the queries that are partiallyrecognized by speech-recognitionsoftware. We collected more than 35million questions from the Internet,aiming to use them to infer templatesor patterns to reconstruct the originalintended question from the speech-recognition input. Despite this, wemust still address several questions:

    How do we know if an input ques-tion from the speech-recognitionsystem is indeed the original usersquestion?;

    How do we know if a question in

    our database is the users intendedquestion?;

    Should we trust the input or thedatabase?; and

    Often, neither the database northe input is always exactly right, so can

    we reconstruct the original question?We provide a mathematical frame-

    work to address these questions andimplement the related RSVP system.Our experiments show RSVP signifi-

    cantly improves current speech-recog-nition systems in the QA domain.

    Related Work

    Speech recognition1has made signifi-cant progress over the past 30 yearssince the introduction of statisticalmethods and hidden Markov mod-els. Many effective algorithms havebeen developed, including the EM al-gorithm, the Baum-Welch algorithm,

    Viterbi N-best search, and N-gramlanguage models trained on large datacorpora. However, as explored by Bak-

    er et al.,1 automatic speech recogni-tion is still an unsolved problem.

    Unlike traditional speech-recog-nition research, we propose a differ-ent paradigm in the QA domain. Init, we are able to collect a very largepure text corpus (no voice) that dif-fers from the fixed command set do-main where it is possible to train upto, say, 10,000 commands. The QA do-main is unbounded, and the number

    of existing questions on QA websitesinvolves more than 100 million ques-tions, yet with very low coverage of allpossible questions. These texts canbe clustered into patterns. Here, wedemonstrate that these patterns have99% coverage of all possible questiontypes, suggesting we can use them toimprove speech-recognition softwarein this domain.

    Previous research suggested thatcontext information,21the knowledge-base,16,20 and conceptual relation-ships15all can help address this.

    Who is the mayor?

  • 8/13/2019 speech recognition system IM

    3/8

    72 COMMUNICATIONS OF THE ACM | JULY 2013 | VOL. 56 | NO. 7

    contributed articles

    Dmax is universal in the sense that italways minorizes any other reason-able computable distance metric.

    This concept, and its normalizedversions, were applied to whole-ge-nome phylogeny,12 chain-letter-evolu-tion history,3 plagiarism detection,4other phylogeny studies,13 music

    classification,5 and parameter-freedata mining,10and has been followedby many other applications (for top-ics mentioned here that do not havereferences, see Li and Vitnyi14) in-cluding protein-structure compari-son, heart-rhythm data analysis, QAsystems, clustering, multiword ex-pression linguistic analysis, softwareevolution and engineering, softwaremetrics and obfuscation, webpageauthorship, topic, and domain identi-

    fication, phylogenetic reconstruction,SVM kernel for string classification,ortholog detection,19analyzing wormsand network traffic, image similarity,Internet knowledge discovery,6 multi-document summarization, networkstructure, and dynamic behavior,17and gene expression dynamics inmacrophase.18

    Despite its useful propertiesand applications, the max distance

    Dmax(x,y) involves several problemswhen only partial matches are con-

    sidered8,22

    where the triangle inequal-ity fails to hold and irrelevant infor-mation must be removed. Thus, Li etal.11 introduced a complementary in-formation-distance metric to resolvethese problems. In Equation 1 we de-termine the smallest number of bitsthat must be used to reversibly convertbetween x and y. To remove the irrel-evant information from x or y, we thusdefine, with respect to a universal Tur-ing machine U, the cost of conversionbetween xandyas

    Emin(x,y) = min{|p| :U(x,p, r) =y,U(y,p,q) =x,|p| + |q| + |r| E(x,y)}, (2)

    This definition separates rfrom xandq from y. Modulo an O(log(|x| + |y|))additive term, the following theorem

    was proved in Li11:

    Theorem 2.Dmin(x,y) = min{K(x|y),K(y|x)}.We can thus define Dmin(x,y) = Emin(x,y)as a complementary information-dis-tance metric that disregards irrelevantinformation. Dmin is obviously sym-

    Information Distance

    Here, we develop the mathematicalframework on which we designed oursystem. To define Kolmogorov com-plexity (invented in the 1960s), westart by fixing a universal Turing ma-chine U. The Kolmogorov complexityof a binary string x, given another bi-

    nary string y, KU(x|y), is the length ofthe shortest (prefix-free) program forU that outputs xwith input y. Since itcan be shown that for a different uni-

    versal Turing machine U, the metricdiffers by only a constant, we write

    K(x|y) instead of KU(x|y). We writeK(x|), where is the empty string, asK(x). We call a string xrandom if K(x) |x|. See Li and Vitnyi14for more onKolmogorov complexity and its richapplications.

    Note K(x) defines the amount ofinformation in x. What would be agood departure point for definingan information distance betweentwo objects? In the 1990s, Bennettet al.2 studied the energy cost of con-

    version between two strings, x and y.John von Neumann hypothesized thatperforming 1 bit of information pro-cessing costs 1KT of energy, where Kis Boltzmanns constant and T is theroom temperature. In the 1960s, ob-serving that reversible computations

    could be done at no cost, Rolf Landau-er revised von Neumanns proposalto hold only for irreversible compu-tations. Starting from this von Neu-mann-Landauer principle, Bennettet al.2 proposed using the minimumnumber of bits needed to convert xtoyand vice versa to define their distance.Formally, with respect to a universalTuring machine U, the cost of conver-sion between xandyis defined as

    E(x,y) = min{|p| : U(x,p) = y, U(y,p) =x} (1)

    It is clear that E(x,y) K(x|y) + K(y|x).Bennett et al.2obtained the followingoptimal result, modulo log(|x| + |y|):

    Theorem 1.E(x,y) = max {K(x|y),K(y|x)}.

    Thus, we define information distancebetween two sequences, x and y, as

    Dmax(x,y) = max{K(x|y),K(y|x)}.This distanceDmaxwas shown to sat-

    isfy the basic distance requirements(such as positivity, symmetricity, andtriangle inequality). Furthermore,

    We use informationon the Internet in

    the QA domain tofind all possiblequestion patterns,then use it tocorrect the queriesthat are partiallyrecognized byspeech-recognition

    software.

  • 8/13/2019 speech recognition system IM

    4/8

    contributed articles

    JULY 2013 | VOL. 56 | NO. 7 | COMMUNICATIONS OF THE ACM 73

    metric but does not satisfy triangleinequality. Note that Dminwas used inthe QUANTA QA system to deal withconcepts that are more popular thanothers.23,24

    Min-Max Distance

    Now we formulate our problem in the

    frame of information distance, givena question database Q and k inputquestions from a speech-recognitionsystem, as in I = {q1,.,qk} (k 3 forthe Google speech-recognition serverin our experiments and k = 1 in ourtranslation application. The goal is tocompute the users intended questionq. It could be one of the qis; it could bea combination of all kof them; and itcould also be one of the questions inQthat is close to some parts of the qis.

    We wish to find the most plau-sible question q, such that q fits oneof the question patterns in Q, and qis close to I. We assume Q containsalmost all question patterns; later, weprovide an experimental justificationfor this claim.

    We can thus formulate our prob-lem as: Given QandI, find qsuch thatit minimizes the sum of distancesfrom Qto qand qtoI, as in

    IqQ

    Here, Qis a huge database of 35 mil-lion questions asked by users. We as-sume qis similar to one of them; forexample, a QA user might ask Whois the mayor of Waterloo, Ontario?,but Q might include such questionsas Who is the mayor of Toronto,Ontario? and Who is the mayorof Washington, D.C.? I sometimescontains questions like Hole is themayor? and Who mayor off Wa-terloo from the speech-recognition

    software. Since Q is so large, the Dmaxmeasure does not make sense here,as most of the information in Q is ir-relevant. It is natural to use Dmin(q,Q)here. For the distance between qand

    I, we use dmax(q,I) to measure it. GivenI, Q, we wish to find qthat minimizesthe function

    Dmin(q,Q) +Dmax(I,q), (3)

    where Dminmeasures information dis-tance between qand Qwith irrelevantinformation removed; Dmax is the in-

    formation distance between q and I.We know

    Dmin(x,y) = min{K(x|y),K(y|x)},Dmax(x,y) = max{K(x|y),K(y|x)}.

    Thus, Dmin (q,Q) = K(q|Q), because Q isvery large and q is a single question.

    Note that d is a coefficient that deter-mines how much weight we wish to giveto a correct template or pattern in Q.

    Equation 3 thus becomes

    K(q|Q) + max {K(q|I),K(I,q)}. (4)

    Obervations: We need >1, so q=Idoesnot minimize Equation 4. If is toolarge, then q= might minimize Equa-tion 4. There is a trade-off: Sometimesa less-popular pattern (taking more

    bits in the Dminterm) might fit Ibetter(taking fewer bits in theDmaxitem), anda more popular pattern (taking fewerbits in theDminitem) might miss one ortwo key words in I, taking more bits toencode in the Dmaxitem. Note that isoptimized for the trade-off.

    Encoding Issues

    To minimize Equation 4, we solvethree problems:

    Encode qusing Qin the first term.It is a problem to encode an item with

    respect to a big set; Encode qusingIor encodeIusing

    q, and take whichever is larger in thesecond term; and

    Find all possible candidates qanda q0that minimizes Equation 4.

    We see that Qis very large and con-tains different types of questions.For each such type, we extract one ormore question templates. In this way,Qcan be viewed as a set of templates,

    with each template, denoted as p, cov-ering a subset of questions from Q.

    When encoding q, we need not encodeq from Q directly. Instead, we encodeq with respect to the patterns or tem-plates of Q; for example, if a pattern

    p in Q appears N times in Q, then weuse log2(|Q|=N) bits to encode the in-dex for this pattern. Given pattern p,

    we encode qwith p by encoding theirword mismatches. There is a trade-off between the encoding of pand theencoding of q, givenp. A common pat-tern may be encoded with a few bitsbut also may require more bits to en-code a specific question using the pat-

    tern; for example, the template whois the mayor of City Name requiresmore bits to encode than the templatewho is the mayor of Noun becausethe former is a smaller class than thelatter. However, the first template re-quires fewer bits to generate the ques-tion who is the mayor of Waterloo

    since it requires fewer bits to encodeWaterloo from the class City Namethan from the class Noun.

    The patterns could be extracted bypre-processing or dynamically accord-ing to the input. In practice, we extractpatterns only from questions relevantto I, denoted as Q. We organize Qhierarchically. Similar questions aremapped to a cluster, and similar clus-ters are mapped to a bigger cluster. Weextract one pattern from each cluster

    using a multiple alignment algorithm.This pattern should be as specific aspossible while at the same time coverall questions in the cluster. Note thatthe higher the cluster in the hierar-chical structure, the more general thepattern. Our hierarchical clusteringalgorithm thus assures we can extractall possible patterns from relevantquestions. We make use of numeroussemantic and syntactic informationsources during the process, includingPOS tagger, Name Entity Recognition,

    WordNet, and Wikipedia. For exam-ple, given a cluster of three questions:

    Who is the mayor of Toronto?; Who is the president of the United

    States?; and Who is a senator from New York?We could extract one pattern (such

    as Who is the Leader of Location?Mayor, president, and senator)are all mapped to the Leader class,

    while Toronto, United States,and New York all belong to the Lo-cation class.

    If we treat pattern pas a sentence,the problem of item-to-set encodingbecomes item-to-item encoding, as inthe computation of K(q|I) and K(I|q).To convert a sentence from anothersentence, we need encode only the

    word mismatches and the missingwords. The best alignment betweentwo sentences is found through astandard dynamic programming algo-rithm. We encode a missing word bythe negative logarithm of their prob-abilities to appear at the given loca-tions and encode the mismatches by

  • 8/13/2019 speech recognition system IM

    5/8

    74 COMMUNICATIONS OF THE ACM | JULY 2013 | VOL. 56 | NO. 7

    contributed articles

    mapped together; for example, giventhree questionswhole is the mayorof Waterloo, hole is the mayor of Wa-ter, and whole was the mayor of Wa-terthe best word alignment wouldlook like this:

    Whole is the mayor of Waterloo

    Hole is the mayor of Water

    Whole was the mayor of Water

    Step 2. Improve input questions: Build a question based on the

    word-alignment results from Step 1;for each aligned word block, we chooseone word to appear in the result;

    We assume that a good formatquestion should contain a wh-word,

    including what, who, which, whose,

    whom, when, where, why, how, or what,or an auxiliary verb, including be, have,do, shall, will (would), shall (should),may (might), must, need, dare, orought. If the inputs do not contain anysuch word, we add proper words intothe question candidates; and

    Since some correct words may notappear in the input, we further expandthe question candidates with hom-onym dictionaries and metaphonedictionaries.

    Step 3. Analyze relevant database

    patterns: Find relevant database questions,

    sorting them based on their seman-tic and syntactic similarity to the im-proved input questions from Step 2;

    If a question is almost the same asone of the input questions, we returnthat input question directly, and nofurther steps are done in this case;

    The database questions involvemany forms and patterns. We groupsimilar questions together through ahierarchical clustering algorithm. Thedistance between two clusters is cal-

    calculating their semantic and mor-phology similarities. It requires fewerbits to encode between synonymsthan antonyms.

    Equation 4 must select the candi-date question q, for which we use twostrategies:

    Offline. We cluster questions in Q

    and generate patterns offline, findingthe most likely pattern, then generateqthat is close to the input and to oneof the patterns; and

    Online.We consider only the ques-tion candidates relevant to the inputand that could be matched by at leastone of our templates generated froma few questions and that share somekeywords with the input in Q.

    Finally, we choose the best q that

    minimizes Equation 4. Furthermore,we apply a bigram language model tofilter questions with low trustworthi-ness. The language model is trainedin our background question set Q. The

    value is trained as a part of our ex-periments. In our system the value isa function of the lengths of the inputquestions.

    We have thus implemented RSVP,which, given speech recognition input{q1, q2, q3}, finds qsuch that Equation4 is minimized.

    Implementation Details

    The following are some RSVP imple-mentation details:

    Step 1. Analyze input: Split the questions into words by

    Stanford Pos Tagger; at the same time,name entities are extracted from theinput questions using Stanford NER,Linepipe NER, and YAGO; and

    Find the best alignments amongthe input questions through dynamicprogramming. Words or name enti-ties with similar pronunciations are

    culated based on the syntactic similar-ity between questions. The algorithmstops when the minimum distancebetween clusters reach a predefinedthreshold;

    When the relevant questions aregrouped into clusters, we are able toextract patterns from each cluster.

    Following the algorithm outlinedin Step 1, we align questions in eachgroup, using their semantic simi-larities to encode the word distance.Then a group of questions is con-

    verted into a single list with multipleword blocks, with each block contain-ing several alternative words from dif-ferent questions; for example, givenquestions Who is the mayor of NewYork, Who is the president of Unit-ed States, and Which person is the

    leader of Toronto, we obtain a list ofword blocks after alignment:

    {who, who, which person}, {is}, {the},{ mayor, leader, president } of { NewYork, United States, Toronto}; and

    For each aligned word block, wefurther extract tags that would best de-scribe the slot; here, YAGO9is used todescribe the meaning of each word orphrase. We extract several most-com-mon facts as the description of each

    word block. We then obtain one orseveral semantic patterns composedof words and facts from YAGO.

    Step 4. Generate the candidate ques-

    tions: Map the original input questions

    into the patterns we extracted fromthe database and replace the wordsin the patterns with the words fromthe input. Many candidate questionscould be generated by considering the

    various combinations of word replace-ments; and

    To reduce complexity, we traina bigram language model from ourquestion set, removing candidatequestions with low probability.

    Step 5. Rank candidate questions us-

    ing information distance: Calculate the distance between

    the candidate questions and the inputquestions K(q|I) and K(I|q). We alignthe candidate and input questionsand encode the word mismatchesand missing words, encoding a miss-ing word through minus logarithm oftheir probability to appear at the said

  • 8/13/2019 speech recognition system IM

    6/8

    contributed articles

    JULY 2013 | VOL. 56 | NO. 7 | COMMUNICATIONS OF THE ACM 75

    locations and calculating word mis-matches through their semantic, mor-phology, and metaphone similarities;

    Calculate the distance betweenthe candidate questions and the pat-terns K(q|p). A method similar to theprevious step is used to calculate thedistances between questions and pat-

    terns; and RSVP Ranks all the candidates us-

    ing Equation 4.Step 6. Return the candidate with

    minimum information distance score

    as the final result: In order to improvespeed, the last three items of step 3may be performed offline on the com-plete database Q.

    Completeness of the Database Q

    We tested the hypothesis that Q con-

    tains almost all common questiontypes. The test set T contained 300questions, selected (with the crite-ria of no more than 11 words or 65letters, one question in a sentence,and no non-English letters) froman independent Microsoft QA set athttp://research.microsoft.com/en-us/downloads/88c0021c-328a-4148-a158-a42d7331c6cf. We found that all butthree have corresponding patterns inQ. Only three questions lacked strictlysimilar patterns in Q: Why is some

    sand white, some brown, and someblack? Do flying squirrels fly or dothey just glide? And was there ever amovement to abolish the electoralcollege? We will provide the data set Tupon request.

    Experiments

    Our experiments aimed to test RSVPsability to correct speech-recognitionerrors in the QA domain, focusingon non-native speakers, as there arethree non-native English speakers

    for each native English speaker in theworld. Here, we further test and justifyour proposed methodology by extend-ing it to translation in the QA domain.

    Experiment setup. We initially (in2011) used the Nuance speech-rec-ognition server and later switched toGoogle speech recognition (http://google.com), because the Google serv-er has no daily quota and respondsquicker. The RSVP system is imple-mented in a client-server architecture.The experiments were performed ata computer terminal with a micro-

    phone. The experimenter would read aquestion, and Google speech recogni-tion would return three options. RSVPuses the three questions as input andcomputes the most likely question.

    Dataset.We use the set Tdescribedin the previous section. T contains300 questions. Twas chosen indepen-

    dently, and TQ=. Not all questionsin Twere used by each speaker in theexperiments; non-native speakers andchildren skipped sentences that con-tain difficult-to-pronounce words,and less-proficient English speakerstend to skip more questions.

    Time complexity. On a server withfour cores, 2.8GHz per core, and 4Gmemory, RSVP typically uses approxi-mately 500ms to correct one question;that is, the speaker reads a question

    into a microphone, Google voice rec-ognition returns three questions, andRSVP uses the questions as input, tak-ing approximately half a second tooutput one final question.

    Human speaker volunteers. Suchexperiments are complex and timeconsuming. We tried our best to re-move the individual speaker varianceby having different people performthe experiments independently. Werecruited 14 human volunteers, in-cluding native and non-native Eng-

    lish speakers, adults and children, fe-males and males (see Table 1) during

    the period 2011 to 2012.We performed 12 sets of experi-

    ments involving 14 different speak-ers, all using the same test set T or asubset of T. Due to childrens natu-rally short attention spans, the threenative English-speaking children (twomales, one female) completed one set

    of experiment (experiment 7), eachresponsible for 100 questions. A non-native-speaking female child, age 12,performed the test (experiment 10)independently but was able to finishonly 57 questions.

    In the following paragraphs, CCsignifies that the speech-recognitionsoftware (from Google) returned thecorrect answer as the first option andRSVP agrees with it; WC signifiesthat the speech-recognition software

    returned the wrong answer as thefirst option and RSVP returned thecorrect answer; CW signifies thatthe speech-recognition software re-turned the correct answer as the firstoption and RSVP returned the wronganswer; and WW signifies that thespeech-recognition software returnedthe wrong answer as the first optionand RSVP also returned the wrong an-swer. All experiments were performedin quiet environments; in each, thespeaker tried again if neither the

    speech recognition nor RSVP was cor-rect (see Table 2).

    Table 1. Individuals used in our experiments.

    Native Speaker Non-Native Speaker

    Adult Child Adult Child

    Female 0 1 4 1

    Male 3 2 3 0

    Table 2. Experimental results for speech correction.

    Experiment # questions CC WC CW WW

    Base Translator

    Accuracy

    RSVP

    Accuracy

    Google as base

    translator

    428 112 211 6 99 27.5% 75.6%

    Microsoft as

    base translator

    428 116 207 11 94 29.6% 75.6%

    Google as base

    translator

    52 21 15 0 16 40% 69%

    Google as base

    translator

    114 44 49 1 20 38% 81.6%

  • 8/13/2019 speech recognition system IM

    7/8

    76 COMMUNICATIONS OF THE ACM | JULY 2013 | VOL. 56 | NO. 7

    contributed articles

    bomb work?Original question: How holidays

    are celebrated around the world?Google speech-recognition output:call holidays are celebrated around inthe wall? call holidays are celebratedaround in the world? how all holidaysare celebrated around in the wall?

    RSVP output: how holidays are cel-ebrated around in the world?

    Original question: Are there anyknown aliens? Google speech-recog-nition output: are there any loans de-

    viance? are there any loans aliens? arethere any known deviance? RSVP out-put: Are there any known aliens?

    Original question: What does afrog eat? Google speech-recognitionoutput: what is front seat? what is thefrogs eat? what does the front seat?RSVP output: What does frogs eat?

    Experiment 2. Native speaker, male;

    Experiment 1. Non-native speaker,male. Here, the speaker chose onlyeasy-to-read questions from the300-question Microsoft test set; thefollowing are examples where RSVPcorrected Googles errors:

    Original question: How many typesof birds are found in North America?

    Google speech-recognition output:How many pints of birds are formedin North America? How many pintsof birds are found in North America?How many pints of birds performedin North America? RSVP output: Howmany types of birds are found in North

    America?Original question: How does the

    atom bomb work? Google speech-recognition output: call does the atombomb work? All does the atom bomb

    work? aha does the atom bomb work?RSVP output: How does the atom

    Experiment 3.Native speaker, male;Experiment 4. Non-native speaker,

    male;Experiment 5. Non-native speaker,

    female;Experiment 6. Non-native speaker,

    female.Experiment 7.Three native English-

    speaking children, 100 questionseach. They are eight years old, female;nine years old, male; and 11 years old,male. In principle, we prefer inde-pendent tests with one individual re-sponsible for the complete set of 300questions. However, we were only ableto get each of the children to do 100questions, skipping the difficult ones.The result is similar to that of adultnative English speakers;

    Experiment 8. Native English speak-

    er, male;Experiment 9. Non-native Englishspeaker, male;

    Experiment 10. Non-native Englishspeaker, female, 11 years old, in Can-ada to attend summer camp to learnEnglish; her English was rudimentaryand consequently was able to readonly 57 questions out of 300;

    Experiment 11. Non-native Englishspeaker, female; and

    Experiment 12. Non-native Englishspeaker, female.

    In our experiments, Table 2, thenon-native speakers and the childrenselected relatively easy-to-read ques-tions (without, say, difficult-to-pro-nounce names) from Tto do the tests.The ratio of improvements was betterfor the non-native speakers, reducingthe number of errors (WW column)by 30% on average for experiments1, 4, 5, 6, 9, 10, 11, and 12. For nativespeakers, RSVP also delivered a clearadvantage, reducing the number of er-rors (WW column) by 16% on average

    for experiments 2, 3, 7, and 8. Suchan advantage would be amplified ina noisy real-life environment. Allow-ing the speaker to repeat the question

    would increase the success rate, as inthe following example (with Google):RSVP generated How many toes doesMary Monroe have? for the first queryand How many titles does MarilynMonroe have? for the second query.Combining the two questions, RSVPgenerated the correct intended ques-tion How many toes does MarilynMonroe have?

    Translation examples.

    Google translation: Fly from Toronto to Beijing long?

    Our translation: How long does it take to fly from Toronto to Beijing?

    Google translation: People who have a few bones?

    Our translation: How many bones do people have?

    Google translation: Taiwans population size?

    Our translation: What is the population size of Taiwan?

    Google translation: When did the dinosaurs extinct?

    Our translation: When did the dinosaurs become extinct?

    Table 3. Experimental results for speech correction

    Experiment Total No. of questions CC WC CW WW

    1 164 105 39 5 15

    2 300 219 25 6 50

    3 300 222 15 5 58

    4 257 141 41 7 68

    5 181 100 26 4 51

    6 214 125 29 10 50

    7 206 145 19 8 34

    8 298 180 12 4 102

    9 131 77 14 0 40

    10 57 28 4 1 24

    11 63 35 9 1 18

    12 107 62 9 2 34

  • 8/13/2019 speech recognition system IM

    8/8

    contributed articles

    JULY 2013 | VOL. 56 | NO. 7 | COMMUNICATIONS OF THE ACM 77

    Translation

    To further justify the methodologyproposed here, we extend the ap-proach to translation in the QA do-main for cross-language search. Here,

    we use Chinese-English cross-lan-guage search as an example, thoughthe methodology works for other lan-

    guages, too.A Chinese-speaking person can

    perform a cross-language search ofthe English Internet in two ways:

    Translate it all.Translate the wholeEnglish Internet, including all QApairs in the English QA communityand English (Wikipidea) databases,into Chinese; or

    Translate a question. Translate aChinese question into English, find-ing the answer in English, then trans-

    late the answer back to Chinese.General-purpose translators todayperform so poorly that the first optionis out of the question. The RSVP meth-odology enables the second option,

    which involves two translations: theChinese question to English, then theEnglish answer back to Chinese. Sincethe QA answers are usually simple andread by humans, and the database re-lations can be manually translated,a general-purpose translator is suffi-cient and sometimes not even needed.

    The key to this approach is translatingChinese questions into English. Weimplemented the translation systemand cross-language (Chinese-English)search as part of the RSVP QA systemthrough the following steps:

    Translate a Chinese question intoEnglish through a general-purposetranslator;

    Apply the (modified) correctionprocedure described here;

    Perform English QA search; and Translate the result back into Chi-

    nese through a general-purpose trans-lator.

    Table 3 outlines experiments withour translation system, using the no-tation outlined earlier: CC, WC, CW,and WW. The first three used datacollected as we developed RSVP; thefourth used an independent dataset of114 questions on a range of topics (seethe figure here).

    Conclusion

    This work would be more effective if itwere integrated into speech-recogni-

    tion software so more voice informa-tion could be used. However, it targetsdynamic special domains that are sonumerous that training them sepa-rately would be prohibitive.

    In addition to special-domaintranslation, the RSVP methodologycan be used to correct the grammati-

    cal errors and spelling mistakes in anormal QA text search, as well as tocreate an automatic writing assistantfor a highly structured domain.

    Moreover, tuning improves allsystems; for example, if we ask Whatis the distance between Toronto and

    Waterloo bla, then we know the extrabla should be removed, as inferredfrom system knowledge, requiringzero bits to encode such a deletion.The theory allows us to add inferra-

    ble rules at no additional cost.

    Acknowledgments

    We thank Li Deng of Microsoft for hisadvice on speech-recognition softwareand Leonid Levin of Boston Universityfor discussions and suggestions on al-ternative formulations. We are gratefulto the volunteers who participated inour experiments. We thank the refer-ees and Nicole Keshav for helping usimprove the article. We also thank AlanBaklor of Answer.com for his help. This

    work has been partially supported byCanadas IDRC Research Chair in Infor-mation Technology program, NSERCDiscovery Grant OGP0046506, theCanada Research Chair program, a CFIInfrastructure grant, an NSERC Collab-orative grant, Ontarios Premiers Dis-covery Award, and the Killam Prize.

    References1. Baker, J.M., Deng, L., Glass, J., Khudanpur, S., Lee, C.H.,

    Morgan, N., and OShaughnessy, D. Developments anddirections in speech recognition and understanding,Part 1. IEEE Signal Processing Magazine 26, 3 (May2009), 7580.

    2. Bennett, C.H., Gcs, P., Li, M., Vitnyi, P., and Zurek,W. Information distance. IEEE Transactions onInformation Theory 44, 4 (July, 1998), 14071423.

    3. Bennett, C.H., Li, M., and Ma, B. Chain letters andevolutionary histories. Scientific American 288, 6(June 2003), 7681.

    4. Chen, X., Francia, B., Li, M., McKinnon, B., and Seker, A.Shared information and program plagiarism detection.IEEE Transactions on Information Theory 50, 7 (July2004), 15451550.

    5. Cilibrasi, R., Vitnyi, P., and de Wolf, R. Algorithmicclustering of music based on string compression.Computer Music Journal 28, 4 (Winter 2004), 4967.

    6. Cilibrasi, R. and Vitnyi, P. The Google similaritydistance. IEEE Transactions on Knowledge and DataEngineering 19, 3 (Mar. 2007), 370383.

    7. Cuturi, M. and Vert, J.P. The context-tree kernel forstrings. Neural Networks 18, 4 (Oct. 2005), 11111123.

    8. Fagin, R. and Stockmeyer, L. Relaxing the triangle

    inequality in pattern matching. International Journalof Computer Vision 28, 3 (1998), 219231.

    9. Hoffart, J., Suchanek, F.M., Berberich, K., and Weikum,G. YAGO2: A spatially and temporally enhancedknowledgebase from Wikipedia. Artificial Intelligence194 (Jan. 2013), 2861.

    10. Keogh, E., Lonardi, S., and Ratanamahatana, C.A.Towards parameter-free data mining. In Proceedingsof the ACM SIGKDD International Conference onKnowledge Discovery and Data Mining. ACM Press,New York, 2004, 206215.

    11. Li, M. Information distance and its applications.International Journal on the Foundations of ComputerScience 18, 4 (Aug. 2007), 669681.

    12. Li, M., Badger, J., Chen, X., Kwong, S., Kearney, P., andZhang, H. An information-based sequence distanceand its application to whole mitochondrial genomephylogeny. Bioinformatics 17, 2 (Feb. 2001), 149154.

    13. Li, M., Chen, X., Li, X., Ma, B., and Vitnyi, P. Thesimilarity metric. IEEE Transactions on InformationTheory 50, 12 (Dec. 2004), 32503264.

    14. Li, M. and Vitnyi, P. An Introduction to KolmogorovComplexity and Its Applications, Third Edition.Springer-Verlag, New York, 2008.

    15. Lieberman, H., Faaborg, A., Daher, W., and Espinosa, J.How to wreck a nice beach you sing calm incense. InProceedings of the 10thInternational Conference onIntelligent User Interfaces (2005), 278280.

    16. Lopes, L.R. A software agent for detecting andcorrecting speech recognition errors using aknowledge base. SATNAC, 2008.

    17. Nykter, M., Price, N.D., Larjo, A., Aho, T., Kauffman,S.A., Yli-Harja, O., and Shmulevich, I. Critical networks

    exhibit maximal information diversity in structure-dynamics relationships. Physical Review Letters 100, 5(Feb. 2008), 058702-706.

    18. Nykter, M., Price, N.D., Aldana, M., Ramsey, S.A.,Kauffman, S.A., Hood, L.E., Yli-Harja, O., andShmulevich, I. Gene expression dynamics in themacrophage exhibit criticality. Proceedings of theNational Academy of Sciences 105, 6 (Feb. 2008),18971900.

    19. Pao, H.K. and Case, J. Computing entropy for orthologdetection. In Proceedings of the InternationalConference on Computational Intelligence (Istanbul,Turkey, Dec. 1719, 2004).

    20. Rosenfeld, R. Two decades of statistical languagemodeling: Where do we go from here? Proceedings ofthe IEEE 88, 8 (Aug. 2000), 12701278.

    21. Sarma, A. and Palmer, D.D. Context-based speechrecognition error detection and correction. InProceedings of the Human Language TechnologyConference(Boston, May 27). Association of

    Computational Linguistics, Stroudsburg, PA, 2004,8588.

    22. Veltkamp, R.C. Shape matching: Similarity measuresand algorithms. In Proceedings of the InternationalConference on Shape Modeling Applications (Genoa,Italy, 2001), 188197.

    23. Zhang, X., Hao, Y., Zhu, X.Y., and Li, M. New informationmeasure and its application in question-answeringsystem. Journal of Computer Science and Technology23, 4 (July 2008), 557572.

    24. Zhang, X., Hao, Y., Zhu, X., and Li, M. Informationdistance from a question to an answer. In Proceedingsof the 13thACM SIGKDD Conference on KnowledgeDiscovery in Data Mining (San Jose, CA, Aug. 1215).ACM Press, New York, 2007, 874883.

    Yang Tang ([email protected]) is a researchassociate in the David R. Cheriton School of Computer

    Science at the University of Waterloo, Waterloo, Ontario.Di Wang ([email protected]) is a graduate studentin the David R. Cheriton School of Computer Science atthe University of Waterloo, Waterloo, Ontario.

    Jing Bai ([email protected]) is a researcher in MicrosoftCorporation, Silicon Valley campus, Sunnyvale, CA.

    Xiaoyan Zhu ([email protected]) is a professor inthe Tsinghua National Laboratory for Information Scienceand Technology and Department of Computer Scienceand Technology and director of State Key Laboratoryof Intelligent Technology and Systems at TsinghuaUniversity, Beijing.

    Ming Li ([email protected]) is Canada Research Chair inBioinformatics and a University Professor in the David R.Cheriton School of Computer Science at the University ofWaterloo, Waterloo, Ontario.

    2013 ACM 0001-0782/13/07