speech recognition system im
TRANSCRIPT
-
8/13/2019 speech recognition system IM
1/8
70 COMMUNICATIONS OF THE ACM | JULY 2013 | VOL. 56 | NO. 7
contributed articles
VOICE INP UT IS a major requirement for practicalquestion answering (QA) systems designed forsmartphones. Speech-recognition technologies arenot fully practical, however, due to fundamentalproblems (such as a noisy environment, speakerdiversity, and errors in speech). Here, we define theinformation distance between a speech-recognitionresult and a meaningful query from which we canreconstruct the intended query, implementing thisframework in our RSVP system.
In 12 test cases covering male, female, child, adult,native, and non-native English speakers, each with
57 to 300 questions from an independent test set of
300 questions, RSVP on average re-duced the number of errors by 16% for
native speakers and by 30% for non-native speakers over the best-knownspeech-recognition software. The idea
was then extended to translation inthe QA domain.
In our project, which is supportedby Canadas International Develop-ment Research Centre (http://www.idrc.ca/), we built a voice-enabledcross-language QA search engine forcellphone users in the developing
world. Using voice input, a QA system
would be a convenient tool for peoplewho do not write, for people with im-paired vision, and for children whomight wish their Talking Tom or R2-D2 really could talk.
The quality of todays speech-rec-ognition technologies, exemplifiedby systems from Google, Microsoft,and Nuance does not fully meet suchneeds for several reasons:
Noisy environments in commonaudio situations;1
Speech variations, as in, say,
adults vs. children, native speakersvs. non-native speakers, and femalevs. male, especially when individualvoice-input training is not possible, asin our case; and
Incorrect and incomplete sen-tences; even customized speech-rec-ognition systems would fail due tocoughing, breaks, corrections, andthe inability to distinguish between,say, sailfish and sale fish.
Speech-recognition systems can betrained for a fixed command set of
up to 10,000 items, a paradigm that
Information
DistanceBetween WhatI Said and
What It Heard
DOI:10.1145/2483852.2483869
The RSVP voice-recognition search engineimproves speech recognition and translationaccuracy in question answering.
BY YANG TANG, DI WANG, JING BAI, XIAOYAN ZHU, AND MING LI
key insights
Focusing on an infinite but highlystructured domain (such as QA),
we significantly improve general-purpose
speech recognition results and
general-purpose translation results.
Assembling a large amount of Internet
data is key to helping us achieve these
goals; in the highly structured QA domain,
we collected millions of human-asked
questions covering 99% of question types.
RSVP development is guided by a theoryinvolving information distance.
-
8/13/2019 speech recognition system IM
2/8
JULY 2013 | VOL. 56 | NO. 7 | COMMUNICATIONS OF THE ACM 71
does not work for general speech rec-ognition. We consider a new paradigm:speech recognition limited to the QAdomain covering an unlimited numberof questions; hence it cannot be trainedas a fixed command set domain. How-ever, our QA domain is highly struc-tured and reflects clear patterns.
We use information on the Inter-net in the QA domain to find all pos-sible question patterns, then use it to
correct the queries that are partiallyrecognized by speech-recognitionsoftware. We collected more than 35million questions from the Internet,aiming to use them to infer templatesor patterns to reconstruct the originalintended question from the speech-recognition input. Despite this, wemust still address several questions:
How do we know if an input ques-tion from the speech-recognitionsystem is indeed the original usersquestion?;
How do we know if a question in
our database is the users intendedquestion?;
Should we trust the input or thedatabase?; and
Often, neither the database northe input is always exactly right, so can
we reconstruct the original question?We provide a mathematical frame-
work to address these questions andimplement the related RSVP system.Our experiments show RSVP signifi-
cantly improves current speech-recog-nition systems in the QA domain.
Related Work
Speech recognition1has made signifi-cant progress over the past 30 yearssince the introduction of statisticalmethods and hidden Markov mod-els. Many effective algorithms havebeen developed, including the EM al-gorithm, the Baum-Welch algorithm,
Viterbi N-best search, and N-gramlanguage models trained on large datacorpora. However, as explored by Bak-
er et al.,1 automatic speech recogni-tion is still an unsolved problem.
Unlike traditional speech-recog-nition research, we propose a differ-ent paradigm in the QA domain. Init, we are able to collect a very largepure text corpus (no voice) that dif-fers from the fixed command set do-main where it is possible to train upto, say, 10,000 commands. The QA do-main is unbounded, and the number
of existing questions on QA websitesinvolves more than 100 million ques-tions, yet with very low coverage of allpossible questions. These texts canbe clustered into patterns. Here, wedemonstrate that these patterns have99% coverage of all possible questiontypes, suggesting we can use them toimprove speech-recognition softwarein this domain.
Previous research suggested thatcontext information,21the knowledge-base,16,20 and conceptual relation-ships15all can help address this.
Who is the mayor?
-
8/13/2019 speech recognition system IM
3/8
72 COMMUNICATIONS OF THE ACM | JULY 2013 | VOL. 56 | NO. 7
contributed articles
Dmax is universal in the sense that italways minorizes any other reason-able computable distance metric.
This concept, and its normalizedversions, were applied to whole-ge-nome phylogeny,12 chain-letter-evolu-tion history,3 plagiarism detection,4other phylogeny studies,13 music
classification,5 and parameter-freedata mining,10and has been followedby many other applications (for top-ics mentioned here that do not havereferences, see Li and Vitnyi14) in-cluding protein-structure compari-son, heart-rhythm data analysis, QAsystems, clustering, multiword ex-pression linguistic analysis, softwareevolution and engineering, softwaremetrics and obfuscation, webpageauthorship, topic, and domain identi-
fication, phylogenetic reconstruction,SVM kernel for string classification,ortholog detection,19analyzing wormsand network traffic, image similarity,Internet knowledge discovery,6 multi-document summarization, networkstructure, and dynamic behavior,17and gene expression dynamics inmacrophase.18
Despite its useful propertiesand applications, the max distance
Dmax(x,y) involves several problemswhen only partial matches are con-
sidered8,22
where the triangle inequal-ity fails to hold and irrelevant infor-mation must be removed. Thus, Li etal.11 introduced a complementary in-formation-distance metric to resolvethese problems. In Equation 1 we de-termine the smallest number of bitsthat must be used to reversibly convertbetween x and y. To remove the irrel-evant information from x or y, we thusdefine, with respect to a universal Tur-ing machine U, the cost of conversionbetween xandyas
Emin(x,y) = min{|p| :U(x,p, r) =y,U(y,p,q) =x,|p| + |q| + |r| E(x,y)}, (2)
This definition separates rfrom xandq from y. Modulo an O(log(|x| + |y|))additive term, the following theorem
was proved in Li11:
Theorem 2.Dmin(x,y) = min{K(x|y),K(y|x)}.We can thus define Dmin(x,y) = Emin(x,y)as a complementary information-dis-tance metric that disregards irrelevantinformation. Dmin is obviously sym-
Information Distance
Here, we develop the mathematicalframework on which we designed oursystem. To define Kolmogorov com-plexity (invented in the 1960s), westart by fixing a universal Turing ma-chine U. The Kolmogorov complexityof a binary string x, given another bi-
nary string y, KU(x|y), is the length ofthe shortest (prefix-free) program forU that outputs xwith input y. Since itcan be shown that for a different uni-
versal Turing machine U, the metricdiffers by only a constant, we write
K(x|y) instead of KU(x|y). We writeK(x|), where is the empty string, asK(x). We call a string xrandom if K(x) |x|. See Li and Vitnyi14for more onKolmogorov complexity and its richapplications.
Note K(x) defines the amount ofinformation in x. What would be agood departure point for definingan information distance betweentwo objects? In the 1990s, Bennettet al.2 studied the energy cost of con-
version between two strings, x and y.John von Neumann hypothesized thatperforming 1 bit of information pro-cessing costs 1KT of energy, where Kis Boltzmanns constant and T is theroom temperature. In the 1960s, ob-serving that reversible computations
could be done at no cost, Rolf Landau-er revised von Neumanns proposalto hold only for irreversible compu-tations. Starting from this von Neu-mann-Landauer principle, Bennettet al.2 proposed using the minimumnumber of bits needed to convert xtoyand vice versa to define their distance.Formally, with respect to a universalTuring machine U, the cost of conver-sion between xandyis defined as
E(x,y) = min{|p| : U(x,p) = y, U(y,p) =x} (1)
It is clear that E(x,y) K(x|y) + K(y|x).Bennett et al.2obtained the followingoptimal result, modulo log(|x| + |y|):
Theorem 1.E(x,y) = max {K(x|y),K(y|x)}.
Thus, we define information distancebetween two sequences, x and y, as
Dmax(x,y) = max{K(x|y),K(y|x)}.This distanceDmaxwas shown to sat-
isfy the basic distance requirements(such as positivity, symmetricity, andtriangle inequality). Furthermore,
We use informationon the Internet in
the QA domain tofind all possiblequestion patterns,then use it tocorrect the queriesthat are partiallyrecognized byspeech-recognition
software.
-
8/13/2019 speech recognition system IM
4/8
contributed articles
JULY 2013 | VOL. 56 | NO. 7 | COMMUNICATIONS OF THE ACM 73
metric but does not satisfy triangleinequality. Note that Dminwas used inthe QUANTA QA system to deal withconcepts that are more popular thanothers.23,24
Min-Max Distance
Now we formulate our problem in the
frame of information distance, givena question database Q and k inputquestions from a speech-recognitionsystem, as in I = {q1,.,qk} (k 3 forthe Google speech-recognition serverin our experiments and k = 1 in ourtranslation application. The goal is tocompute the users intended questionq. It could be one of the qis; it could bea combination of all kof them; and itcould also be one of the questions inQthat is close to some parts of the qis.
We wish to find the most plau-sible question q, such that q fits oneof the question patterns in Q, and qis close to I. We assume Q containsalmost all question patterns; later, weprovide an experimental justificationfor this claim.
We can thus formulate our prob-lem as: Given QandI, find qsuch thatit minimizes the sum of distancesfrom Qto qand qtoI, as in
IqQ
Here, Qis a huge database of 35 mil-lion questions asked by users. We as-sume qis similar to one of them; forexample, a QA user might ask Whois the mayor of Waterloo, Ontario?,but Q might include such questionsas Who is the mayor of Toronto,Ontario? and Who is the mayorof Washington, D.C.? I sometimescontains questions like Hole is themayor? and Who mayor off Wa-terloo from the speech-recognition
software. Since Q is so large, the Dmaxmeasure does not make sense here,as most of the information in Q is ir-relevant. It is natural to use Dmin(q,Q)here. For the distance between qand
I, we use dmax(q,I) to measure it. GivenI, Q, we wish to find qthat minimizesthe function
Dmin(q,Q) +Dmax(I,q), (3)
where Dminmeasures information dis-tance between qand Qwith irrelevantinformation removed; Dmax is the in-
formation distance between q and I.We know
Dmin(x,y) = min{K(x|y),K(y|x)},Dmax(x,y) = max{K(x|y),K(y|x)}.
Thus, Dmin (q,Q) = K(q|Q), because Q isvery large and q is a single question.
Note that d is a coefficient that deter-mines how much weight we wish to giveto a correct template or pattern in Q.
Equation 3 thus becomes
K(q|Q) + max {K(q|I),K(I,q)}. (4)
Obervations: We need >1, so q=Idoesnot minimize Equation 4. If is toolarge, then q= might minimize Equa-tion 4. There is a trade-off: Sometimesa less-popular pattern (taking more
bits in the Dminterm) might fit Ibetter(taking fewer bits in theDmaxitem), anda more popular pattern (taking fewerbits in theDminitem) might miss one ortwo key words in I, taking more bits toencode in the Dmaxitem. Note that isoptimized for the trade-off.
Encoding Issues
To minimize Equation 4, we solvethree problems:
Encode qusing Qin the first term.It is a problem to encode an item with
respect to a big set; Encode qusingIor encodeIusing
q, and take whichever is larger in thesecond term; and
Find all possible candidates qanda q0that minimizes Equation 4.
We see that Qis very large and con-tains different types of questions.For each such type, we extract one ormore question templates. In this way,Qcan be viewed as a set of templates,
with each template, denoted as p, cov-ering a subset of questions from Q.
When encoding q, we need not encodeq from Q directly. Instead, we encodeq with respect to the patterns or tem-plates of Q; for example, if a pattern
p in Q appears N times in Q, then weuse log2(|Q|=N) bits to encode the in-dex for this pattern. Given pattern p,
we encode qwith p by encoding theirword mismatches. There is a trade-off between the encoding of pand theencoding of q, givenp. A common pat-tern may be encoded with a few bitsbut also may require more bits to en-code a specific question using the pat-
tern; for example, the template whois the mayor of City Name requiresmore bits to encode than the templatewho is the mayor of Noun becausethe former is a smaller class than thelatter. However, the first template re-quires fewer bits to generate the ques-tion who is the mayor of Waterloo
since it requires fewer bits to encodeWaterloo from the class City Namethan from the class Noun.
The patterns could be extracted bypre-processing or dynamically accord-ing to the input. In practice, we extractpatterns only from questions relevantto I, denoted as Q. We organize Qhierarchically. Similar questions aremapped to a cluster, and similar clus-ters are mapped to a bigger cluster. Weextract one pattern from each cluster
using a multiple alignment algorithm.This pattern should be as specific aspossible while at the same time coverall questions in the cluster. Note thatthe higher the cluster in the hierar-chical structure, the more general thepattern. Our hierarchical clusteringalgorithm thus assures we can extractall possible patterns from relevantquestions. We make use of numeroussemantic and syntactic informationsources during the process, includingPOS tagger, Name Entity Recognition,
WordNet, and Wikipedia. For exam-ple, given a cluster of three questions:
Who is the mayor of Toronto?; Who is the president of the United
States?; and Who is a senator from New York?We could extract one pattern (such
as Who is the Leader of Location?Mayor, president, and senator)are all mapped to the Leader class,
while Toronto, United States,and New York all belong to the Lo-cation class.
If we treat pattern pas a sentence,the problem of item-to-set encodingbecomes item-to-item encoding, as inthe computation of K(q|I) and K(I|q).To convert a sentence from anothersentence, we need encode only the
word mismatches and the missingwords. The best alignment betweentwo sentences is found through astandard dynamic programming algo-rithm. We encode a missing word bythe negative logarithm of their prob-abilities to appear at the given loca-tions and encode the mismatches by
-
8/13/2019 speech recognition system IM
5/8
74 COMMUNICATIONS OF THE ACM | JULY 2013 | VOL. 56 | NO. 7
contributed articles
mapped together; for example, giventhree questionswhole is the mayorof Waterloo, hole is the mayor of Wa-ter, and whole was the mayor of Wa-terthe best word alignment wouldlook like this:
Whole is the mayor of Waterloo
Hole is the mayor of Water
Whole was the mayor of Water
Step 2. Improve input questions: Build a question based on the
word-alignment results from Step 1;for each aligned word block, we chooseone word to appear in the result;
We assume that a good formatquestion should contain a wh-word,
including what, who, which, whose,
whom, when, where, why, how, or what,or an auxiliary verb, including be, have,do, shall, will (would), shall (should),may (might), must, need, dare, orought. If the inputs do not contain anysuch word, we add proper words intothe question candidates; and
Since some correct words may notappear in the input, we further expandthe question candidates with hom-onym dictionaries and metaphonedictionaries.
Step 3. Analyze relevant database
patterns: Find relevant database questions,
sorting them based on their seman-tic and syntactic similarity to the im-proved input questions from Step 2;
If a question is almost the same asone of the input questions, we returnthat input question directly, and nofurther steps are done in this case;
The database questions involvemany forms and patterns. We groupsimilar questions together through ahierarchical clustering algorithm. Thedistance between two clusters is cal-
calculating their semantic and mor-phology similarities. It requires fewerbits to encode between synonymsthan antonyms.
Equation 4 must select the candi-date question q, for which we use twostrategies:
Offline. We cluster questions in Q
and generate patterns offline, findingthe most likely pattern, then generateqthat is close to the input and to oneof the patterns; and
Online.We consider only the ques-tion candidates relevant to the inputand that could be matched by at leastone of our templates generated froma few questions and that share somekeywords with the input in Q.
Finally, we choose the best q that
minimizes Equation 4. Furthermore,we apply a bigram language model tofilter questions with low trustworthi-ness. The language model is trainedin our background question set Q. The
value is trained as a part of our ex-periments. In our system the value isa function of the lengths of the inputquestions.
We have thus implemented RSVP,which, given speech recognition input{q1, q2, q3}, finds qsuch that Equation4 is minimized.
Implementation Details
The following are some RSVP imple-mentation details:
Step 1. Analyze input: Split the questions into words by
Stanford Pos Tagger; at the same time,name entities are extracted from theinput questions using Stanford NER,Linepipe NER, and YAGO; and
Find the best alignments amongthe input questions through dynamicprogramming. Words or name enti-ties with similar pronunciations are
culated based on the syntactic similar-ity between questions. The algorithmstops when the minimum distancebetween clusters reach a predefinedthreshold;
When the relevant questions aregrouped into clusters, we are able toextract patterns from each cluster.
Following the algorithm outlinedin Step 1, we align questions in eachgroup, using their semantic simi-larities to encode the word distance.Then a group of questions is con-
verted into a single list with multipleword blocks, with each block contain-ing several alternative words from dif-ferent questions; for example, givenquestions Who is the mayor of NewYork, Who is the president of Unit-ed States, and Which person is the
leader of Toronto, we obtain a list ofword blocks after alignment:
{who, who, which person}, {is}, {the},{ mayor, leader, president } of { NewYork, United States, Toronto}; and
For each aligned word block, wefurther extract tags that would best de-scribe the slot; here, YAGO9is used todescribe the meaning of each word orphrase. We extract several most-com-mon facts as the description of each
word block. We then obtain one orseveral semantic patterns composedof words and facts from YAGO.
Step 4. Generate the candidate ques-
tions: Map the original input questions
into the patterns we extracted fromthe database and replace the wordsin the patterns with the words fromthe input. Many candidate questionscould be generated by considering the
various combinations of word replace-ments; and
To reduce complexity, we traina bigram language model from ourquestion set, removing candidatequestions with low probability.
Step 5. Rank candidate questions us-
ing information distance: Calculate the distance between
the candidate questions and the inputquestions K(q|I) and K(I|q). We alignthe candidate and input questionsand encode the word mismatchesand missing words, encoding a miss-ing word through minus logarithm oftheir probability to appear at the said
-
8/13/2019 speech recognition system IM
6/8
contributed articles
JULY 2013 | VOL. 56 | NO. 7 | COMMUNICATIONS OF THE ACM 75
locations and calculating word mis-matches through their semantic, mor-phology, and metaphone similarities;
Calculate the distance betweenthe candidate questions and the pat-terns K(q|p). A method similar to theprevious step is used to calculate thedistances between questions and pat-
terns; and RSVP Ranks all the candidates us-
ing Equation 4.Step 6. Return the candidate with
minimum information distance score
as the final result: In order to improvespeed, the last three items of step 3may be performed offline on the com-plete database Q.
Completeness of the Database Q
We tested the hypothesis that Q con-
tains almost all common questiontypes. The test set T contained 300questions, selected (with the crite-ria of no more than 11 words or 65letters, one question in a sentence,and no non-English letters) froman independent Microsoft QA set athttp://research.microsoft.com/en-us/downloads/88c0021c-328a-4148-a158-a42d7331c6cf. We found that all butthree have corresponding patterns inQ. Only three questions lacked strictlysimilar patterns in Q: Why is some
sand white, some brown, and someblack? Do flying squirrels fly or dothey just glide? And was there ever amovement to abolish the electoralcollege? We will provide the data set Tupon request.
Experiments
Our experiments aimed to test RSVPsability to correct speech-recognitionerrors in the QA domain, focusingon non-native speakers, as there arethree non-native English speakers
for each native English speaker in theworld. Here, we further test and justifyour proposed methodology by extend-ing it to translation in the QA domain.
Experiment setup. We initially (in2011) used the Nuance speech-rec-ognition server and later switched toGoogle speech recognition (http://google.com), because the Google serv-er has no daily quota and respondsquicker. The RSVP system is imple-mented in a client-server architecture.The experiments were performed ata computer terminal with a micro-
phone. The experimenter would read aquestion, and Google speech recogni-tion would return three options. RSVPuses the three questions as input andcomputes the most likely question.
Dataset.We use the set Tdescribedin the previous section. T contains300 questions. Twas chosen indepen-
dently, and TQ=. Not all questionsin Twere used by each speaker in theexperiments; non-native speakers andchildren skipped sentences that con-tain difficult-to-pronounce words,and less-proficient English speakerstend to skip more questions.
Time complexity. On a server withfour cores, 2.8GHz per core, and 4Gmemory, RSVP typically uses approxi-mately 500ms to correct one question;that is, the speaker reads a question
into a microphone, Google voice rec-ognition returns three questions, andRSVP uses the questions as input, tak-ing approximately half a second tooutput one final question.
Human speaker volunteers. Suchexperiments are complex and timeconsuming. We tried our best to re-move the individual speaker varianceby having different people performthe experiments independently. Werecruited 14 human volunteers, in-cluding native and non-native Eng-
lish speakers, adults and children, fe-males and males (see Table 1) during
the period 2011 to 2012.We performed 12 sets of experi-
ments involving 14 different speak-ers, all using the same test set T or asubset of T. Due to childrens natu-rally short attention spans, the threenative English-speaking children (twomales, one female) completed one set
of experiment (experiment 7), eachresponsible for 100 questions. A non-native-speaking female child, age 12,performed the test (experiment 10)independently but was able to finishonly 57 questions.
In the following paragraphs, CCsignifies that the speech-recognitionsoftware (from Google) returned thecorrect answer as the first option andRSVP agrees with it; WC signifiesthat the speech-recognition software
returned the wrong answer as thefirst option and RSVP returned thecorrect answer; CW signifies thatthe speech-recognition software re-turned the correct answer as the firstoption and RSVP returned the wronganswer; and WW signifies that thespeech-recognition software returnedthe wrong answer as the first optionand RSVP also returned the wrong an-swer. All experiments were performedin quiet environments; in each, thespeaker tried again if neither the
speech recognition nor RSVP was cor-rect (see Table 2).
Table 1. Individuals used in our experiments.
Native Speaker Non-Native Speaker
Adult Child Adult Child
Female 0 1 4 1
Male 3 2 3 0
Table 2. Experimental results for speech correction.
Experiment # questions CC WC CW WW
Base Translator
Accuracy
RSVP
Accuracy
Google as base
translator
428 112 211 6 99 27.5% 75.6%
Microsoft as
base translator
428 116 207 11 94 29.6% 75.6%
Google as base
translator
52 21 15 0 16 40% 69%
Google as base
translator
114 44 49 1 20 38% 81.6%
-
8/13/2019 speech recognition system IM
7/8
76 COMMUNICATIONS OF THE ACM | JULY 2013 | VOL. 56 | NO. 7
contributed articles
bomb work?Original question: How holidays
are celebrated around the world?Google speech-recognition output:call holidays are celebrated around inthe wall? call holidays are celebratedaround in the world? how all holidaysare celebrated around in the wall?
RSVP output: how holidays are cel-ebrated around in the world?
Original question: Are there anyknown aliens? Google speech-recog-nition output: are there any loans de-
viance? are there any loans aliens? arethere any known deviance? RSVP out-put: Are there any known aliens?
Original question: What does afrog eat? Google speech-recognitionoutput: what is front seat? what is thefrogs eat? what does the front seat?RSVP output: What does frogs eat?
Experiment 2. Native speaker, male;
Experiment 1. Non-native speaker,male. Here, the speaker chose onlyeasy-to-read questions from the300-question Microsoft test set; thefollowing are examples where RSVPcorrected Googles errors:
Original question: How many typesof birds are found in North America?
Google speech-recognition output:How many pints of birds are formedin North America? How many pintsof birds are found in North America?How many pints of birds performedin North America? RSVP output: Howmany types of birds are found in North
America?Original question: How does the
atom bomb work? Google speech-recognition output: call does the atombomb work? All does the atom bomb
work? aha does the atom bomb work?RSVP output: How does the atom
Experiment 3.Native speaker, male;Experiment 4. Non-native speaker,
male;Experiment 5. Non-native speaker,
female;Experiment 6. Non-native speaker,
female.Experiment 7.Three native English-
speaking children, 100 questionseach. They are eight years old, female;nine years old, male; and 11 years old,male. In principle, we prefer inde-pendent tests with one individual re-sponsible for the complete set of 300questions. However, we were only ableto get each of the children to do 100questions, skipping the difficult ones.The result is similar to that of adultnative English speakers;
Experiment 8. Native English speak-
er, male;Experiment 9. Non-native Englishspeaker, male;
Experiment 10. Non-native Englishspeaker, female, 11 years old, in Can-ada to attend summer camp to learnEnglish; her English was rudimentaryand consequently was able to readonly 57 questions out of 300;
Experiment 11. Non-native Englishspeaker, female; and
Experiment 12. Non-native Englishspeaker, female.
In our experiments, Table 2, thenon-native speakers and the childrenselected relatively easy-to-read ques-tions (without, say, difficult-to-pro-nounce names) from Tto do the tests.The ratio of improvements was betterfor the non-native speakers, reducingthe number of errors (WW column)by 30% on average for experiments1, 4, 5, 6, 9, 10, 11, and 12. For nativespeakers, RSVP also delivered a clearadvantage, reducing the number of er-rors (WW column) by 16% on average
for experiments 2, 3, 7, and 8. Suchan advantage would be amplified ina noisy real-life environment. Allow-ing the speaker to repeat the question
would increase the success rate, as inthe following example (with Google):RSVP generated How many toes doesMary Monroe have? for the first queryand How many titles does MarilynMonroe have? for the second query.Combining the two questions, RSVPgenerated the correct intended ques-tion How many toes does MarilynMonroe have?
Translation examples.
Google translation: Fly from Toronto to Beijing long?
Our translation: How long does it take to fly from Toronto to Beijing?
Google translation: People who have a few bones?
Our translation: How many bones do people have?
Google translation: Taiwans population size?
Our translation: What is the population size of Taiwan?
Google translation: When did the dinosaurs extinct?
Our translation: When did the dinosaurs become extinct?
Table 3. Experimental results for speech correction
Experiment Total No. of questions CC WC CW WW
1 164 105 39 5 15
2 300 219 25 6 50
3 300 222 15 5 58
4 257 141 41 7 68
5 181 100 26 4 51
6 214 125 29 10 50
7 206 145 19 8 34
8 298 180 12 4 102
9 131 77 14 0 40
10 57 28 4 1 24
11 63 35 9 1 18
12 107 62 9 2 34
-
8/13/2019 speech recognition system IM
8/8
contributed articles
JULY 2013 | VOL. 56 | NO. 7 | COMMUNICATIONS OF THE ACM 77
Translation
To further justify the methodologyproposed here, we extend the ap-proach to translation in the QA do-main for cross-language search. Here,
we use Chinese-English cross-lan-guage search as an example, thoughthe methodology works for other lan-
guages, too.A Chinese-speaking person can
perform a cross-language search ofthe English Internet in two ways:
Translate it all.Translate the wholeEnglish Internet, including all QApairs in the English QA communityand English (Wikipidea) databases,into Chinese; or
Translate a question. Translate aChinese question into English, find-ing the answer in English, then trans-
late the answer back to Chinese.General-purpose translators todayperform so poorly that the first optionis out of the question. The RSVP meth-odology enables the second option,
which involves two translations: theChinese question to English, then theEnglish answer back to Chinese. Sincethe QA answers are usually simple andread by humans, and the database re-lations can be manually translated,a general-purpose translator is suffi-cient and sometimes not even needed.
The key to this approach is translatingChinese questions into English. Weimplemented the translation systemand cross-language (Chinese-English)search as part of the RSVP QA systemthrough the following steps:
Translate a Chinese question intoEnglish through a general-purposetranslator;
Apply the (modified) correctionprocedure described here;
Perform English QA search; and Translate the result back into Chi-
nese through a general-purpose trans-lator.
Table 3 outlines experiments withour translation system, using the no-tation outlined earlier: CC, WC, CW,and WW. The first three used datacollected as we developed RSVP; thefourth used an independent dataset of114 questions on a range of topics (seethe figure here).
Conclusion
This work would be more effective if itwere integrated into speech-recogni-
tion software so more voice informa-tion could be used. However, it targetsdynamic special domains that are sonumerous that training them sepa-rately would be prohibitive.
In addition to special-domaintranslation, the RSVP methodologycan be used to correct the grammati-
cal errors and spelling mistakes in anormal QA text search, as well as tocreate an automatic writing assistantfor a highly structured domain.
Moreover, tuning improves allsystems; for example, if we ask Whatis the distance between Toronto and
Waterloo bla, then we know the extrabla should be removed, as inferredfrom system knowledge, requiringzero bits to encode such a deletion.The theory allows us to add inferra-
ble rules at no additional cost.
Acknowledgments
We thank Li Deng of Microsoft for hisadvice on speech-recognition softwareand Leonid Levin of Boston Universityfor discussions and suggestions on al-ternative formulations. We are gratefulto the volunteers who participated inour experiments. We thank the refer-ees and Nicole Keshav for helping usimprove the article. We also thank AlanBaklor of Answer.com for his help. This
work has been partially supported byCanadas IDRC Research Chair in Infor-mation Technology program, NSERCDiscovery Grant OGP0046506, theCanada Research Chair program, a CFIInfrastructure grant, an NSERC Collab-orative grant, Ontarios Premiers Dis-covery Award, and the Killam Prize.
References1. Baker, J.M., Deng, L., Glass, J., Khudanpur, S., Lee, C.H.,
Morgan, N., and OShaughnessy, D. Developments anddirections in speech recognition and understanding,Part 1. IEEE Signal Processing Magazine 26, 3 (May2009), 7580.
2. Bennett, C.H., Gcs, P., Li, M., Vitnyi, P., and Zurek,W. Information distance. IEEE Transactions onInformation Theory 44, 4 (July, 1998), 14071423.
3. Bennett, C.H., Li, M., and Ma, B. Chain letters andevolutionary histories. Scientific American 288, 6(June 2003), 7681.
4. Chen, X., Francia, B., Li, M., McKinnon, B., and Seker, A.Shared information and program plagiarism detection.IEEE Transactions on Information Theory 50, 7 (July2004), 15451550.
5. Cilibrasi, R., Vitnyi, P., and de Wolf, R. Algorithmicclustering of music based on string compression.Computer Music Journal 28, 4 (Winter 2004), 4967.
6. Cilibrasi, R. and Vitnyi, P. The Google similaritydistance. IEEE Transactions on Knowledge and DataEngineering 19, 3 (Mar. 2007), 370383.
7. Cuturi, M. and Vert, J.P. The context-tree kernel forstrings. Neural Networks 18, 4 (Oct. 2005), 11111123.
8. Fagin, R. and Stockmeyer, L. Relaxing the triangle
inequality in pattern matching. International Journalof Computer Vision 28, 3 (1998), 219231.
9. Hoffart, J., Suchanek, F.M., Berberich, K., and Weikum,G. YAGO2: A spatially and temporally enhancedknowledgebase from Wikipedia. Artificial Intelligence194 (Jan. 2013), 2861.
10. Keogh, E., Lonardi, S., and Ratanamahatana, C.A.Towards parameter-free data mining. In Proceedingsof the ACM SIGKDD International Conference onKnowledge Discovery and Data Mining. ACM Press,New York, 2004, 206215.
11. Li, M. Information distance and its applications.International Journal on the Foundations of ComputerScience 18, 4 (Aug. 2007), 669681.
12. Li, M., Badger, J., Chen, X., Kwong, S., Kearney, P., andZhang, H. An information-based sequence distanceand its application to whole mitochondrial genomephylogeny. Bioinformatics 17, 2 (Feb. 2001), 149154.
13. Li, M., Chen, X., Li, X., Ma, B., and Vitnyi, P. Thesimilarity metric. IEEE Transactions on InformationTheory 50, 12 (Dec. 2004), 32503264.
14. Li, M. and Vitnyi, P. An Introduction to KolmogorovComplexity and Its Applications, Third Edition.Springer-Verlag, New York, 2008.
15. Lieberman, H., Faaborg, A., Daher, W., and Espinosa, J.How to wreck a nice beach you sing calm incense. InProceedings of the 10thInternational Conference onIntelligent User Interfaces (2005), 278280.
16. Lopes, L.R. A software agent for detecting andcorrecting speech recognition errors using aknowledge base. SATNAC, 2008.
17. Nykter, M., Price, N.D., Larjo, A., Aho, T., Kauffman,S.A., Yli-Harja, O., and Shmulevich, I. Critical networks
exhibit maximal information diversity in structure-dynamics relationships. Physical Review Letters 100, 5(Feb. 2008), 058702-706.
18. Nykter, M., Price, N.D., Aldana, M., Ramsey, S.A.,Kauffman, S.A., Hood, L.E., Yli-Harja, O., andShmulevich, I. Gene expression dynamics in themacrophage exhibit criticality. Proceedings of theNational Academy of Sciences 105, 6 (Feb. 2008),18971900.
19. Pao, H.K. and Case, J. Computing entropy for orthologdetection. In Proceedings of the InternationalConference on Computational Intelligence (Istanbul,Turkey, Dec. 1719, 2004).
20. Rosenfeld, R. Two decades of statistical languagemodeling: Where do we go from here? Proceedings ofthe IEEE 88, 8 (Aug. 2000), 12701278.
21. Sarma, A. and Palmer, D.D. Context-based speechrecognition error detection and correction. InProceedings of the Human Language TechnologyConference(Boston, May 27). Association of
Computational Linguistics, Stroudsburg, PA, 2004,8588.
22. Veltkamp, R.C. Shape matching: Similarity measuresand algorithms. In Proceedings of the InternationalConference on Shape Modeling Applications (Genoa,Italy, 2001), 188197.
23. Zhang, X., Hao, Y., Zhu, X.Y., and Li, M. New informationmeasure and its application in question-answeringsystem. Journal of Computer Science and Technology23, 4 (July 2008), 557572.
24. Zhang, X., Hao, Y., Zhu, X., and Li, M. Informationdistance from a question to an answer. In Proceedingsof the 13thACM SIGKDD Conference on KnowledgeDiscovery in Data Mining (San Jose, CA, Aug. 1215).ACM Press, New York, 2007, 874883.
Yang Tang ([email protected]) is a researchassociate in the David R. Cheriton School of Computer
Science at the University of Waterloo, Waterloo, Ontario.Di Wang ([email protected]) is a graduate studentin the David R. Cheriton School of Computer Science atthe University of Waterloo, Waterloo, Ontario.
Jing Bai ([email protected]) is a researcher in MicrosoftCorporation, Silicon Valley campus, Sunnyvale, CA.
Xiaoyan Zhu ([email protected]) is a professor inthe Tsinghua National Laboratory for Information Scienceand Technology and Department of Computer Scienceand Technology and director of State Key Laboratoryof Intelligent Technology and Systems at TsinghuaUniversity, Beijing.
Ming Li ([email protected]) is Canada Research Chair inBioinformatics and a University Professor in the David R.Cheriton School of Computer Science at the University ofWaterloo, Waterloo, Ontario.
2013 ACM 0001-0782/13/07