13.0 voice-based information retrieval references: 1. “ speech and language technologies for audio...
TRANSCRIPT
13.0 Voice-based Information Retrieval
References: 1. “ Speech and Language Technologies for Audio Indexing and Retrieval ”, Proceedings of the IEEE, Aug 2000
2. “ Discriminating Capabilities of Syllable-based Features and Approaches of Utilizing Them for Voice Retrieval of Speech Information in Mandarin Chinese”, IEEE Transactions on Speech and Audio Processing, Vol.10, No.5, July 2002, pp.303-314.
3. Baeza-Yates & Ribeiro Neto, “ Modern Information Retrieval”, ACM Press, 1999
4. ACM Special Interest Group on Information Retrieval, http://www.acm.org/sigir
5. “ A Hidden Markov Model Information Retrieval System”, ACM SIGIR, 1999
6. “ Improved Spoken Document Retrieval with Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis (PLSA)”, Informational Conference on Acoustics, Speech and Signal Processing (ICASSP), 2006
7. “Position Specific Posterior Lattices for Indexing Speech”, 43-th Annual Meeting of the Association for Computational Linguistics (ACL), 2005, pp.443-450
Voice –enabled Web-based Applications
voice information
Private and Personal Services
Public Information and Services
Future Networks
voice
input/
outputvoice-based information
retrieval
text informationtext-to-speech
synthesis
spoken dialogue
• Network Access is Primarily Text-based today, but almost all Roles of Texts can be Replaced by Voice in the Future
• Human-Network Interactions can be Accomplished by Spoken Dialogues
• Voice-based Information Retrieval needs to be integrated with Spoken Dialogues
• More Multi-media Information including Voice but not including Enough Text will be Available on the Web in the Future
Voice-based Information Retrieval
Voice Queries
我想找有關紐約受到恐怖攻擊的新聞?我想找有關紐約受到恐怖攻擊的新聞?Text Queries
d1
Text Information
d2
d3
d1
d2
d3
Voice Information
美國總統布希今天早上…
•Speech/Text Queries, Speech/Text Documents
•Mobile/Office User Environments with Multi-modality
•Speech Provides Better User Interface in Wireless Environment
Information Retrieval
• Indexing– Document representation :d
• Query formation– User request representation :q
• Retrieval– Matching query to documents
– Returning relevant documents
• Relevance feedback– Assessing retrieved results
– Modifying initial query
– Iterated retrieval: automatic (blind)/manual
• Performance evaluation– Performance measure
Indexing
Retrieval
Evaluation Feedback
Query
Formation
user request
performance
document representation: d
query representation: q
documents
list of relevant documents in order
Performance Measures
• Recall and Precision Rates
• Non-Interpolated Average Precision– Averaged at all relevant documents retrieved and over all queries
– e.g. relevant documents ranked at 1, 5, 10, precisions are 1/1, 2/5, 3/10, non-interpolated average precision=(1/1+2/5+3/10)/3
B A C
retrieved documents
relevant documents
Precision rate =
Recall rate =A
A+C
A A+B
– similar to missing/false alarm rates– recall-precision plot similar to ROC
curves– recall rate may be difficult to
evaluate, while precision rate is directly perceived by users
Approaches to Speech-based Information Retrieval
• Indexing Elements– Words: Large-vocabulary Based
· create text transcription of spoken documents/queries by speech recognition
· use text retrieval methods· error propagation, out-of-vocabulary
(OOV) problems, special terms– Subword Units: Subword Based
· subword units: phones/syllables/something similar
· a segment of one to a few subword units may carry some indexing information
· not limited by the vocabulary· small size/handling some
OOV/probably more ambiguity
– Keywords: Keyword Based· based on a set of keywords· keyword selection: user specify/a
prior/fixed/automatic generated· special terms for dynamic documents
– Hybrid: Fusion of Information
• Indexing Features– a single element– different combinations of more than one
elements– pre-defined, or automatically selected
by data-driven approaches– each of such features is called an
“indexing term”• Retrieval Model Examples
– vector space models– latent semantic indexing (LSI)– statistical (probabilistic) models– hidden Marcov model (HMM)– combinations/hybrid models
Vector Space Model
• Vector Representations of query q and document d– for each type j of indexing feature a vector is generated – each component in this vector is the weighted statistics zjt of a specific indexing
term t
ct: frequency counts for the indexing term t present in the query q or document d (for text), or sum of normalized recognition scores or confidence measures for the indexing term t (for speech)
N: total number of documents in the database Nt: total number of documents in the database which include the indexing term tIDF: the significance (or importance) or indexing power for the indexing term t
• The Overall Relevance Score is the Weighted Sum of the Relevance Scores for all Types of Indexing Features
feature indexing of with typedocument and query for tionsrepresentavector :,
),(
jdqdq
dqdqdqR
jj
jjjjjjj
tscoefficien weighting:
),(),(
j
jjjjj
w
dqRwdqR
ttjt NNcz ln]ln[1
Inverse Document Frequency(IDF)
Term Frequency(TF)
Improved Retrieval Technique Examples
• Blind Relevance Feedback– the information from the relevant and irrelevant documents retrieved in the
previous stage used to identify more helpful indexing terms– the initial query is reformulated accordingly:
q= · q + · d - · d
q, d: vector representation for the query and documentsDr : selected set of relevant documents retrieved in the previous stageDirr: selected set of irrelevant documents deleted in the previous stageq: new query representation
,,: weighting coefficients• Query Expansion by Term Association
– the indexing terms co-occurring frequently in the same documents assumed to have some synonymity association
– build an association matrix for each type of the indexing features, in which each entry ( i , j ) stands for the association between indexing terms ti and tj :
– reformulate the query expression by adding indexing terms with higher synonymity
and termsindexingboth including database in the documents ofnumber :ˆ, termsindexing theincluding database in the documents ofnumber :,
1),(0 , examplean as ˆ
ˆ),(
,
,
,
jiji
jiji
jiji
ji
ttf
ttff
jiAfff
fjiA
Dr Dirr
Difficulties in Speech-based Information Retrieval for Chinese Language
• Even for Text-based Information Retrieval, Flexible Wording Structure Makes it Difficult to Search by Comparing the Character Strings Alone
– name/title 李登輝→李前總統登輝,李前主席登輝 (President T.H Lee)
– arbitrary abbreviation 北二高→北部第二高速公路 (Second Northern Freeway)
– similar phrases 中華文化→中國文化 (Chinese culture)– translated terms 巴塞隆那→巴瑟隆納 (Barcelona)
• Word Segmentation Ambiguity Even for Text-based Information Retrieval
–腦科 (human brain studies) →電腦科學 (computer science)
–土地公 (God of earth) →土地公有政策 (policy of public sharing of the land)
• Uncertainties in Speech Recognition– errors (deletion, substitution, insertion)
– out of vocabulary (OOV) words, etc.
– very often the key phrases for retrieval are OOV
Syllable-Level Indexing Features for Chinese Language
• A Whole Class of Syllable-Level Indexing Features with Complete Phonological Coverage and Better Discriminating Functions– Overlapping syllable segments with length N
– Syllable pairs separated by M syllables
• Character- or Word-Level Features can be Similarly Defined
S(N), N=1
N=2
N=3
P(M), M=1
M=2
S1 S2 S3 S4 S5 ………S10
Syllable Pair Separated by M syllables
Examples
P(M), M=1 (s1 s3) (s2 s4)……(s8 s10)
P(M), M=2 (s1 s4) (s2 s5)……(s7 s10)
P(M), M=3 (s1 s5) (s2 s6)……(s6 s10)
P(M), M=4 (s1 s6) (s2 s7)……(s5 s10)
Syllable-Level Statistical Features
• Single Syllables– each syllable usually shared by more than one characters with different
meanings, thus causing ambiguity– all words are composed by syllables, thus partially handle OOV problem– very often relevant words have some syllables in common
• Overlapping Syllable Segments with Length N– capturing the information of polysyllabic words or phrases with flexible
wording structures
– majority of Chinese words are bi-syllabic
– not too many polysyllabic words share the same pronunciation
• Syllable Pairs Separated by M Syllables– tackling the problems arising from the flexible wording structure,
abbreviations, and deletion, insertion, substitution errors in speech recognition
Improved Syllable-level Indexing Features
• Syllable Lattice and syllable-level utterance verification– Including multiple syllable hypothesis to construct syllable-aligned
lattices for both query and documents– Generating multiple syllable-level indexing features from syllable
lattices– filtering out indexing terms with lower acoustic confidence scores
• Infrequent term deletion (ITD)– Syllable-level statistics trained with text corpus used to prune infrequent
indexing terms• Stop terms (ST)
– Indexing terms with the lowest IDF scores are taken as the stop terms
syllables with higher acoustic confidence scoressyllables with lower acoustic confidence scoressyllable pairs S(N), N=2 pruned by ITDsyllable pairs S(N), N=2 pruned by ST
Hidden Markov Model (HMM) for Speech-based Information Retrieval
• Modeling the Query q as a Sequence of Input Observations (Indexing Terms), q=t1t2...tn...tN, and each Document d as a HMM (1-state at the moment) Composed of Distributions of N-gram Parameters
• MAP Principle (as a simple example)
• Observation Probability in the HMM state (as a simple example)
– m1,m2,m3,m4 trained by EM/MCE
,,
is
14132
21
1211
CttPmdttPmCtPmdtPm
CtPmdtPmRdqP
nnnn
N
nnn
P (tn|d), p(tn|tn-1,d) unigram/bi-gram trained from the
document dP (tn|C), p(tn|tn-1,C)
unigram/bi-gram trained from a large corpus, specially helpful for missing terms in the documents
)R is dProb(qd
R) is )Prob(dR is dProb(q)qR is Prob(dd
max arg d
*
max arg d
max arg d
*
q: input query, d: all documents in the database“is R”: is relevant
reduced to maximum likelihood without prior knowledge
3m
4m
1m
2m
q = t1 t2...tn...tN
p (tn|d)
p (tn|C)
p (tn|tn-1, d)
p (tn|tn-1, C)
m1+ m2+m3+m4=1
Latent Semantic Indexing (LSI) Model for Speech-based Information Retrieval• Term-Document Matrix
– M indexing terms {t1,t2,...tM} and N documents {d1,d2,....dN}
– wij =lij·gi , lij: local weight
gi: global weight
• Singular Value Decomposition (SVD)
– u i = uiS term vectorv i = v iS document vector
– reduced to R-dimensional space of “latent semantic concepts”• Query q considered as a new document “folded-in”
– relevance score:
NMijwW ][
USVWW
dq
dq
vv
vvdqR
),(
TF/IDF ),/ln()]ln(1[
)1)((
iijij
ij
ijij
NNcwn
cw
, normalized with document length and term entropy, or
T
, S= diagonal with singular values
Udv qq T
Concept Matching
Term Matching
Speech-based Information Retrieval by Keywords ― An Example• Automatic Keyword Extraction from Texts integrated with Keyword
Spotting
• Integration with Other Approaches
Keyword Spotting
Keyword-based Retrieval
Keyword Set
IndexingFile
AutomaticKeyword Extraction
SpeechRecognition
SpeechDocuments
TextDocuments
Retrieved Text/Speech Documents
Extracted Keywords
SpottedKeywords
input speech query
transcription of speech
documents
Voice-based Information Retrieval
— how far are we from the text-based information retrieval ?
Lin-shan Lee
National Taiwan University
Taipei, Taiwan, ROC
Introduction:
Voice-based Information Retrieval
Text/Voice-based Information Retrieval
• Text-based Information Retrieval Extremely Successful
– information desired by the users can be obtained efficiently in real time– all users like it– producing very successful applications and industry
• All Roles of Texts can be Accomplished by Voice– spoken information or multimedia information with voice in audio part– voice instructions/queries via handheld devices
• How about Voice-based Information Retrieval?
userinstructions/
queriesInternet
Server
Server Documents/Information
Voice-based Information Retrieval (1/2)
Voice Instructions/Queries
Newly elected president of US?Newly elected president of US?Text Instructions/Queries
Text InformationVoice Information
(multimedia including audio part)
• If Voice Documents/Queries can be Accurately Recognized - voice-based reduced to text-based information retrieval• Correct but Never Possible
Barack Obama….Barack Obama….
Voice-based Information Retrieval (2/2)
Voice Instructions/Queries
Newly elected president of US?Newly elected president of US?Text Instructions/Queries
Text InformationVoice Information
(multimedia including audio part)
•User Instructions and/or Network Content Can be in form of Voice - text queries/spoken documents
- spoken queries/text documents
- spoken queries/spoken documents
Barack Obama….Barack Obama….
Text Queries/Spoken Documents
• Spoken Document Retrieval
– started with longer documents/queries at relatively higher ASR accuracies
– started with text-based approaches applied on 1-best transcriptions
– inadequate for short documents/queries with relatively poor ASR accuracies
• Spoken Term Detection– emerged probably from the successful term matching paradigm for text-based approaches
– considering multiple alternatives from ASR output (e.g. lattices) to handle ASR errors
– different from the traditional task of Keyword Spotting in that the query set is open
[Chelba, Hazen, Saraclar, IEEE SPM 08][Vergyri, et al, Interspeech 07]
[Saraclar & Sproat, HLT 04][Mamou, et al, SIGIR 06][Chelba & Acero, ACL
05]
Spoken Queries/Text Documents
• Voice Search– information to be retrieved existing in a large text database (e.g. directory
assistance)– out-of-vocabulary (OOV) words in the database– disambiguated by dialogues
• Spoken Query Processing– using a lattice of possible terms as the queries– more semantic analysis performed during retrieval
[Moreno-Daniel, Juang, Wilpon, ICASSP 07, Interspeech 08]
user SearchASR
DisambiguationDialogueManager
query Database
[Wang & Acero, IEEE SPM 08][Acero, et al, ICASSP 08]
[Yu, Wang, Acero, Interspeech 07]
n-best results
Spoken Queries/Spoken Documents
• Uncertainty on Both Sides
• Query-by-example
[Chia, et al, SIGIR 08]• Comparing Two Lattices of Queries/Documents by Graphical Model
[Lin et al, Interspeech 08]
Wireless and Multimedia Technologies are Creating An Environment of Voice-based Information retrieval
voice information Multimedia
Content
Internet
voice
input/
output
text information
• Many Hand-held Devices with Multimedia Functionalities Commercially Available Today
• Unlimited Quantities of Multimedia Content Available over the Internet
• User-Content Interaction necessary for Information Retrieval can be Accomplished by Spoken and Multi-modal Dialogues
• Network Access is Primarily Text-based today, but almost all Roles of Texts can be Accomplished by Voice
Multimedia Content Analysis
Text Information Retrieval
Text Content
Voice-based Information
Retrieval
Text-to-Speech Synthesis
Spoken and multi-modal
Dialogue
Why Is Text-based Information Retrieval Useful and Attractive?
• Spoken/multimedia documents not easily summarized on-screen, thus difficult to scan and select
• Lacks efficient user-system interaction
• Retrieved documents easily summarized on-screen, thus easily scanned and selected by user
• Users may easily select query terms suggested for next iteration retrieval in an interactive process
• Problems with speech recognition errors, especially for spontaneous speech under adverse environments
• Retrieval accuracy acceptable to users
• Retrieved documents properly ranked and filtered
• Spoken/multimedia content are the new trend
• Can be realized even sooner given mature technologies
• Rich resources—huge quantities of text documents available over the Internet
• Quantity continues to increase exponentially due to convenient access
Voice-basedText-based
How about Voice-based Information Retrieval?R
esou
rces
Acc
urac
yU
ser-
Sys
tem
In
tera
ctio
n
Accuracy for Voice-based Information retrieval
Accuracy for Voice-based Information Retrieval
• Low Recognition Accuracies for Spontaneous Speech including Out-of-Vocabulary (OOV) Words under Adverse Environment
considering lattices with multiple alternatives rather than 1-best output
– higher probability of including correct words, but also including more noisy words
– correct words may still be excluded (OOV and others)– huge memory and computation requirements other approaches: confusion matrix, fuzzy matching…
W6W8
W4
W1
W8
W7
W9
W3
W2
W5
W10
Start nodeEnd node
Time index
Wi: word hypotheses
[Mamou & Ramabhadran, Interspeech 08]
Efficient Forms of Lattices for Indexing Purposes – Indexing Structures• Lattices
• An Example of Indexing Structure
– reduced memory and computation requirements (still huge…)– added possible paths– noisy words discriminated by posterior probabilities or similar scores– n-grams matched and accumulated for all possible n
W6W8
W4
W1
W8
W7
W9
W3
W2
W5
W10
Start nodeEnd node
Time index
W9, p9 W5, p5
W2, p2
W10, p10
W4, p4
W8, p8
W3, p3
W6, p6
W7, p7
W1, p1
W9, p9 W5, p5
W2, p2
W10, p10
W4, p4
W8, p8
W3, p3
W6, p6
W7, p7
W1, p1
Examples of Indexing Structures
• Position Specific Posterior Lattices (PSPL)[Chelba & Acero, ACL 05]
• Confusion Networks (CN)[Mamou, et al, SIGIR 06][Hori, Hazen, Glass, ICASSP 07]
• Time-based Merging for Indexing (TMI)[Zhou, Chelba, Seide, HLT 06][Seide, et al, ASRU 07]
• Time-anchored Lattice Expansion (TALE)[Seide, et al, ASRU 07]
• WFST
– directly compile the lattice into a weighted finite state transducer
[Allauzen, et al, HLT 04][Saraclar & Sproat, HLT 04]
Two Examples of Indexing Structures: Position Specific Posterior Lattices (PSPL), Confusion Networks (CN)
End node
W6 W8
W4
W1
W7W8W9W10
W8
W7
W9
W3
W2
W5
W10
Start node
Time index
W1W2, W3W4W5, W6W8W9W10,
• PSPL:─ Locating a word in a segment according to the order of the word in a path
• CN:─ Clustering several words in a segment according to similar time spans and word
pronunciation
W3: prob
W7: prob
W2: probW1: prob W5: probW9: prob
W10: prob
PSPL structure:
W6: prob
segment 1
W4: probW8: prob
segment 2 segment 4segment 3 segment 1 segment 2 segment 3
W6: prob
W2: probW9: probW4: probW1: prob
CN structure:
W3: prob
W7: prob
W8: prob W5: probW10: prob
segment 4
All paths:
Lattice:
OOV or Rare Words Handled by Subword Units
• OOV Word W=w1w2w3w4
– wi : subword units : phonemes, syllables…– a, b, c, d, e : other subword units
• W can’t be Recognized and never Appears in Lattice – can’t be found– W=w1w2w3w4 hidden at subword level– can be matched at subword level without being recognized
• Subword-based PSPL (S-PSPL) or CN (S-CN), for Example
w2w3
w3w4bcdw3w4e
w3w4b
aw1w2
w1w2
Lattice:
Time index
Subword-based Indexing Structures (1/2)
Time index
w2_4
w5_4
w1_1
w1_
2
w1_3 w2_
1
w2_2w2_3
w3_2w3_1
w4_
1
w4_
2
W5_1w5_
2
w5_
3w7_1
w7_2
w6_1
w6_2 w8_1w8_2
w8_2w8_
1
w9_1
w9_
2
w10_1
w10_
2
• Constructed from Phone Lattices (assuming the subword unit is the phone) from Phone Decoder
– Relatively higher phone error rates
[Ng, MIT 00][Wallace, et al, Interspeech 07]
• Word Lattices Represented by Subword Arcs:
– Only sub-strings of subword units for in-vocabulary words can be generated
[Saraclar & Sproat, HLT 04][Vergyri, et al, Interspeech 07]
Subword-based Indexing Structures (2/2)
S-PSPL structure: S-CN structure:
….. …..
w1_1: prob w1_2: prob …. …..
segment 1
…..
…..
segment 2 segment 8
w5_4: prob
….. …..
w1_1: prob w1_2: prob …. …..
…..
segment 1
…..
…..
segment 2 segment 8
…..
w2_4: prob
…..
– Strings of subword units are not constrained by in-vocabulary words any longer
[Pan & Lee, Interspeech 07][Pan & Lee, ASRU 07]
• Subword-based PSPL and CN (S-PSPL, S-CN)
• Hybrid Word-based and Subword-based Structures[Yu & Seide, HLT 05]
Frequently Used Subword Units – Language Dependent (1/2)
• Phonemes– English and many alphabetic languages– Phone n-grams– Particles : groups of phonemes obtained data-driven
[Ng, MIT 00][Wallace, et al, Interspeech 07][Logan, et al, IEEE T. Multimedia 05]
• Graphemes[Wang & King, ICASSP 08]
• Graphones[Bisani & Ney, Interspeech 05][Akbacak & Vergyri, ICASSP 08]
• Morphs– Morph-based languages : Finnish, Turkish, etc.– Morpheme-like units
[Turunen & Kurimo, SIGIR 07][Parlak & Saraclar, ICASSP 08]
Frequently Used Subword Units – Language Dependent (2/2)
• Phonetic Word Fragments– Derived bottom-up data-driven
[Yu & Seide, HLT 05]
• Syllables/Characters– Mandarin Chinese and similar monosyllable-based languages– Syllable/character n-grams– Syllable/character pair separated by a syllable/character
[Chen & Lee, IEEE T. SAP 02][Pan & Lee, ASRU 07]
[Meng & Seide, ASRU 07, Interspeech 08]
[Shao & Seide, Interspeech 08]
User-System Interaction for Voice-based Information Retrieval
Issues in User-System Interation — Difficulties in Browsing, Scanning, and Selecting Multimedia/Spoken Documents Text Documents (including those for voice search, etc.) are
Better Structured and Easier to Browse
— in paragraphs with titles, or in well structured databases
— easily summarized on-screen
— easily scanned and selected by user Multimedia/Spoken Documents are just Video/Audio Signals
— not easily summarized on-screen
— difficult to scan and select
— lacks efficient scenario for user-system interaction
Proposed Approach—Multimedia/Spoken Document Understanding and Organization for Multi-modal User Interfaces Semantic Analysis for Spoken Documents — analyzing the semantic content of the spoken documents
Key Term Extraction from Multimedia/Spoken Documents — very often are out-of-vocabulary (OOV) words such as person/organization/ location names Multimedia/Spoken Document Segmentation — automatically segmenting a spoken document into short paragraphs, each with a
central topic Summarization and Title Generation for Multimedia/Spoken Documents — automatically generating a summary and a title (in text or speech form) for each short paragraph Topic Analysis and Organization for Multimedia/Spoken Documents — analyzing the subject topics for the short paragraphs and organizing them into
graphic structures
[Lee & Chen, IEEE SPM 05][Lee, et al, Interspeech 06]
Creating A Set of Latent Topics between A Set of Terms and A Set of Documents
Modeling the Relationships by Probabilistic Models Trained with EM Algorithm
An Example Approach of Semantic Analysis for Spoken Documents : Probabilistic Latent Semantic Analysis (PLSA)
Latent Topic Entropy
- carries less topical information - carries more topical information
Key Term Extraction ( 1/2)
[Kong & Lee, ICASSP 06][Hsieh & Lee, ICASSP 06]
Latent Topic Significance
— a term tj with respect to a topic Tk
P(Tk|Di) : how each document Di is focused on the topic Tk
[1-P(Tk|Di)] : the probability that each document Di addresses all other topics different
from Tk
[Kong & Lee, ICASSP 06]
Key Term Extraction ( 1/2)
Spoken Document Summarization
Selecting Important Sentences to be Concatenated into a Summary
— sentence scoring and selection
— given a summarization ratio Selected Sentences Collectively Represent Some Concepts Closest
to those of the Complete Document
— removing the concepts already mentioned previously
— concepts presented smoothly[Furui, et al, ICASSP 05, IEEE T. SAP 04][Hirschberg, et al, Interspeech 05]
[Murray, Renals, et al, ACL 05, HLT 06][Kawahara, et al, ICASSP 04]
[Nakagawa, et al, SLT 06][Zhu& Penn, Interspeech 06]
[Fung, et al, ICASSP 08][Kong & Lee, ICASSP 06, SLT 06]
One example: Delicate Scored Viterbi Search
Trainingcorpus
TermOrdering
Model
TermSelection
Model
TitleLengthModel
Spoken documentASR and Automatic
Summarization
ViterbiAlgorithm
OutputTitle
Summary
Title Generation for Spoken Documents
[Witbrock & Mittal, SIGIR 99][Jin & Hauptmann, HLT 01]
[Chen & Lee, Interspeech 03] [Wang & Lee, SLT 08]
Global Semantic Structuring
— Offering a global picture of the semantic structure of the entire archive
Query-based Local Semantic Structuring
— Offering a detailed semantic structure of the relevant documents retrieved by the query
Latent Topic Analysis and Organization for Spoken Documents
Clustered by the Latent Topics and Organized in a Two-dimensional Tree Structure, or as a Multi-layered Map— Documents addressing similar topics grouped in the same cluster
— Distance between clusters on the map has to do with relationships between topics for the documents
— A cluster with many documents can be expanded into another map in the next layer
Two-dimensional Tree Structure
for Organized Topics
Global Semantic Structuring for the Entire Archive
[Li & Lee, Interspeech 05]
User’s Query Produces many Retrieved Spoken Documents– Difficult to be shown on-screen
A Topic Hierarchy Constructed for the Retrieved Documents– each node represents a cluster of retrieved documents labeled by a key
term (or topic)– User may select or delete the nodes directly
Better User-System Interaction
Multi-modal Dialogue
User
Retrieved Documents
Spoken Document
Archive
Retrieval System
Query/Instruction
Topic Hierarchy
Query-based Local Semantic Structuring for Retrieved Spoken Documents
[Pan & Lee, ASRU 05]
Improved Interactive Retrieval of Spoken Documents by Ranking the Key Terms in the Topic Hierarchy• Query Term Suggestions in Text-based Information Retrieval
very helpful• User-System Interaction for Spoken Document Retrieval
• Properly Ranking the Topics in the Topic Hierarchy– suggesting important/relevant key terms on the top of the hierarchy– automatically learned and performed by the dialogue manager
Topic Hierarchy
User
Multi-modal Dialogue
Retrieved Documents
Spoken Document
Archive
Retrieval System
Query/Instruction
[Pan & Lee, Interspeech 06, SLT 06]
User-System Interaction in Spoken Dialogue Systems
Well-organizedDatabase
Speech, Graph, Tables
Dialogue Modeling
System Action
Dialogue Manager
Input Speech Utterance
^
ASRLanguage
Understandingwords,lattices
Dialogue Act Classification
Semantic Frame
Dialogue State
Output Generator
Spoken language Understanding
User Act
U Au
S
• Spoken Dialogue Systems
• Example Goals– Higher task success rate (reliability)– Smaller average number of turns for successful tasks (efficiency)
Dialogue Systems for Voice-based Information Retrieval
• Voice-based Information Retrieval
ASR
Multimedia Document Archive
Retrieval Engine
Indexing
word/ phone lattice, one-best, N-best
ASR
inverted index file
word/ phone lattice, one-best, N-best
Spoken Language based Information Access
Internal State
Dialogue Modeling
Related Documents
Multi-modal
User Interface
Dialogue ManagerOutput
Presentation
Multi-modal interactions Information
Obtained
d
Spoken Documents
Input Spoken Query
q
ASR
Multimedia Document Archive
Retrieval Engine
Indexing
word/ phone lattice, one-best, N-best
ASR
inverted index file
word/ phone lattice, one-best, N-best
Spoken Language based Information Access
Internal State
Dialogue Modeling
Related Documents
Multi-modal
User Interface
Dialogue ManagerOutput
Presentation
Multi-modal interactions Information
Obtained
d
Spoken Documents
Input Spoken Query
q
• Example Goals– higher task success rate (success: user’s information need satisfied)– smaller average number of dialogue turns (average number of query terms entered) for
successful tasks
•Dialogues Equally Useful in Voice Search for Text Documents
[Pan & Lee, ASRU 07]
[Wang & Acero, IEEE SPM 08][Acero, et al, ICASSP 08]
Concluding Remarks
Voice/text-based Information Retrieval
Accuracy — More Reliable Retrieval Techniques
• Problems– Poor recognition accuracies for spontaneous speech under adverse environments– Serious OOV problem
• Possible approaches– Lattices and efficient indexing structures– Subword units (covering OOV words, across different languages and using less space)– Methods for reducing computation and memory requirements– Other techniques useful in text-based retrieval: query expansion, semantic concept matching,
etc.
u1,p1
u2,p2
u3,p3
u4,p4
u5,p5
u6,p6
u7,p7
u8,p8
u9,p9
u10,p10
u11,p11
u12,p12
u13,p13
u14,p14
u15,p15
u16,p16
u17,p17
u18,p18
u19,p19
User-System Interaction — More Efficient Interaction Scenario• Problems
– Spoken/multimedia documents not easily summarized on-screen, thus difficult to scan and select
– Lacks efficient user-system interactions
– Disambiguation by user-system interaction always important even for text documents (e.g. voice search)
• Possible Approaches– Automatic summary/title generation for spoken/multimedia documents
– Topic hierarchy construction for retrieved documents, with nodes labeled by key terms
– Multi-modal user-system dialogue with improved interaction
Titles Summaries
ResourceArchives
User
Retrieved Documents
Retrieval System
Query/Instruction
Topic Hierarchy
Multi-modal Dialogue