interactive navigation across spoken document archives with topic hierarchies constructed by...
TRANSCRIPT
Interactive Navigation across
Spoken Document Archives with Topic Hierarchies
Constructed by Properly Ranked Key Terms
Speaker: Yi-Cheng Pan (Thomas)
Advisor: Lin-Shan Lee
Introduction
• In the past decade, Google successfully attracted people’s attention on Information Retrieval (IR)– Huge Information need– Explosive Information supply
• Many other possibilities for IR are now considered– Implicit/Explicit Relevance Feedback– Clustering of retrieved documents
Introduction
• Most attractive network content will be multi-media – As a result, the spoken documents associated with the
multi-media content very possibly can serve as the key for retrieval and browsing
• Very substantial research efforts have been made in recent years, and very successful techniques and systems have been developed in the area of Spoken Document Retrieval (SDR)– Robust features– Efficient utilization of lattice output of ASR results– filling up the gap between SDR and text IR
Introduction
• Maybe we should also begin to think other possibilities about SDR
Introduction
• There is a gap between a Spoken Document Retrieval (SDR) system and the user– The user knows what he needs, but not how to
translate it into a good query– The system knows exactly which query leads to
which set of documents, but needs a good mechanism to probe the user needs
• Large number of documents retrieved, including many irrelevant, for a given query
Introduction
• Subword-based features used for SDR– To bypass the OOV problem– To constrain the size of indexing features– Noisy documents retrieved by similar subw
ord-based features, e.g., “to recognize speech” v.s. “to wreck a nice beach”
• Retrieved spoken documents are difficult to browse– Difficult to be shown on the screen
Introduction
• Here, we propose to use multi-modal dialogues to help the user navigate across the spoken document archives and retrieve the desired documents based on a topic hierarchy constructed by the key terms extracted from the retrieved spoken documents– The key terms in the topic hierarchy are
properly rankedUser
TopicHierarchy
Clicking on the nodes
Retrieval System
Spoken Document
Archive
Text/spoken query
The overall system chart
Broadcast NewsArchives
Multi-modal Dialogue for Information Navigation and Retrieval
Key TermRecognition Broadcast News
Retrieval
Topic HierarchyConstruction Discourse and
DialogueManager
TopicHierarchy
Titles, Summaries and Complete Documents
AutomaticGeneration of
Titles andSummaries
User input(Spoken Queries or Clicks)
Ranking Algorithm
In the following slides, Named Entities and
Broadcast News are taken as examples for the key
terms and spoken document archives,
respectively
Bro
adca
st N
ews
Arc
hive
s
Multi-modal Dialogue for Information Navigation and Retrieval
Named EntityRecognition Broadcast News
Retrieval
Topic HierarchyConstruction Discourse and
DialogueManager
TopicHierarchy
Titles, Summaries and Complete Documents
AutomaticGeneration of
Titles andSummaries
User input(Spoken Queries or Clicks)
Ranking Algorithm
NAMED ENTITY (NE) RECOGNITION FROM
BROADCAST NEWS• Named Entities
– PER, LOC, ORG– key for the content of the news– many heuristic rules and carefully
designed algorithms are available to recognize named entities from spoken documents
• Moreover, Named Entities can serve as one kind of semantic features for SDR
Named Entity Recognition module
• In this module, two special approaches are proposed– One for text document
• Incorporate the efficient data structure, PAT Tree
– One for spoken document• Efficiently utilize the lattice information and
the first special approach
Special approach for NER in text documents
• In traditional NER approach, the well-known local (internal/external) evidences are used
• Here we propose to further use global evidences
• The basic idea– Even Language Technologies can’t handle such pro
blem• “Even Language Technologies” v.s. “Language Technol
ogies”–記者陳永在金門報導
•陳永 v.s. 陳永在
Special approach for NER in text documents
• If we can efficiently count the frequencies and their likelihoods for being a word (considering their context) of all substrings in our document collection– incorporate such global evidences with
the two traditional local evidences into a better strategy
Efficient access and storage of global evidences : PAT
TreeBranchNodes
Leaf Nodes
0
0
0
1
A B C D E F
0
0
0 0
A:0010B:0011C:0110D:1000
E:1011F:1110
0
CB=1
CB=2CB=2
CB=3CB=4
A:0010 B:0011
C:0110
D:1000 E:1011
F:1110
CB=2
CB=1
CB=4
CB=2
CB=3
F:1110
C:0110
B:0011
A:0010
E:1011
D:1000
root
Named Entity Recognition module
• In this module, two special approaches are proposed– One for text document
• Incorporate the efficient data structure, PAT Tree
– One for spoken document• Efficiently utilize the lattice information and
the first special approach
Special approach for NER in spoken documents
• Many Named Entities are out-of-vocabulary words and can’t be obtained in the ASR output
• We propose to recover these OOV words using Internet as external knowledge sources
Internet
Anded queries
Matching with arcs with low confidence scores NE candidate
lattice Internet
Relevant text documents
PAT Trees
Possible NEs
Word graph
Special approach for NER in spoken documents
• Possible NE generation:– Word arcs in the word graph with
higher confidence measure and higher tf-idf scores are Anded as queries to retrieve a set of relevant text documents from Internet
– NER is then performed on this set of documents and the recognized Named Entities (NEs) are pooled as NE candidates
• NE matching with the lattice:– Arcs with low confidence scores,
therefore possibly with OOVs, are matched with the possible NEs. If the similarity is higher than a threshold, it will be put into the NE candidate lattice
Internet
Anded queries
Matching with arcs with low confidence scores
NE candidate lattice
Internet
Relevant text documents
PAT Trees
Possible NEs
Word graph
Special approach for NER in spoken documents
• NE matching with the lattice:– In order to perform the matching between two pho
ne sequences, we defined a phone similarity matrix, which is based on both the acoustic distance (from the Mahalanobis Distance between the HMM models of the phones) and the pronunciation distance (from the probability that a phone is likely to be pronounced as another phone in the pronunciation model)
• In the generated NE candidate lattice, we perform the baseline NE verification/ classification procedure
Special approach for NER in spoken documents
• the baseline NE verification/ classification procedure– Given a sequence of Chinese characters S = s1, ..., sn, there exis
t many possible word sequences W = w1, ...,wm (each word is composed of one to several characters) with corresponding class sequences C = c1, ..., cm, (the classes here are PN, LOC, ORG and all other words in the vocabulary) and the purpose here is to find the best word sequence W = w1
*, ...,wm* and its corr
esponding class sequence C = c1* , ..., cm
* that maximize P(C,W) as shown below:
(C¤;W¤) = argmaxC;W P (C;W)
= argmaxC;W P (WjC) ¢P (C)
Special approach for NER in spoken documents
• the baseline NE verification/ classification procedure
P (C) ¼P (c1) ¢P (c2jc1) ¢Q m
i=1 P (ci jci ¡ 2ci ¡ 1)
P (WjC) = P (w1 : : :wmjc1 : : :cm) ¼Q m
i=1 P (wi jci )
Bro
adca
st N
ews
Arc
hive
s
Multi-modal Dialogue for Information Navigation and Retrieval
Named EntityRecognition Broadcast News
Retrieval
Topic HierarchyConstruction Discourse and
DialogueManager
TopicHierarchy
Titles, Summaries and Complete Documents
AutomaticGeneration of
Titles andSummaries
User input(Spoken Queries or Clicks)
Ranking Algorithm
Broadcast News retrieval enhanced by Named Entities
• The NEs recognized from broadcast news are apparently extra indexing features for SDR
• We can off-line generate the NE-document matrix and utilize such matrix in two similar ways:– LSA
• The input query vector is folded as a pseudo-document in the latent semantic space to cosine with other documents in the same space
– PLSA• P(q|d) is computed based on
P (qjd) ¼Q
i P (ti jd) =Q
i
Pk P (ti jTk) ¢P (Tkjd)
Broadcast News retrieval enhanced by Named Entities
• The above two approaches were integrated with a baseline broadcast news retrieval system based on Mandarin syllable-level indexing terms with vector space model
• For each news story the baseline system and the LSA and PLSA approaches respectively produce a score for the given query q
• The weighted sum for these scores are then used to select the retrieved news stories
Bro
adca
st N
ews
Arc
hive
s
Multi-modal Dialogue for Information Navigation and Retrieval
Named EntityRecognition Broadcast News
Retrieval
Topic HierarchyConstruction Discourse and
DialogueManager
TopicHierarchy
Titles, Summaries and Complete Documents
AutomaticGeneration of
Titles andSummaries
User input(Spoken Queries or Clicks)
Ranking Algorithm
Topic hierarchy construction from the broadcast news
• The hierarchical organization of retrieved text documents to help the user to browse through the relevant documents has been well studied
• The extension to spoken documents is not straightforward because of the many recognition errors in the transcriptions
• Here we propose to use the relatively reliable Named Entities recognized with the special approaches to construct the topic hierarchy
Topic hierarchy construction from the broadcast news
• There are some important reasons to choose NEs rather than other terms or phrases to play this role– NEs provide high coverage for the broadcast news (alm
ost all news stories contain Named Entities) – high discriminative ability (i.e., they easily separate new
s stories addressing different topics) and thus are very useful augmented query terms
• Hierarchical Agglomerative Clustering and Partitioning algorithm (HAC+P) recently proposed for text documents (Chuang and Chien, 2005), but here performed on NEs recognized from broadcast news
The HAC+P algorithm
• This algorithm consists of two phases: – HAC-based clustering to construct a binary-
tree hierarchy – A partitioning (P) algorithm to transform the
binary-tree hierarchy to a balanced and comprehensive m-ary hierarchy considering two factors:• to have good cluster quality• to have reasonable branching number of each n
ode
The HAC+P algorithm
• HAC-based clustering– for each involved NE or key term t, we built
a feature vector vt for it by averaging the vector representations of all news stories including t
– Bottom-up clustering by average linkage and the similarity measure between two clusters Ci and Cj is defined asS(Ci ;Cj ) = 1
jC i jjC j j ¢P
vt 2C i
Pvs 2C j
cosine(vt;vs)
The HAC+P algorithm
• The partitioning (P) algorithm– Top-down fashion– Select the best cuts iteratively
C9
C1 C2 C3 C4 C5
C6
C7
C8
1
2
3
4
(a)
Cut level l C9
C1 C2 C3 C4 C5
C6
C7
(b)
The partitioning (P) algorithm
• Select the best cuts• Two factors are considered
– Cluster quality• Each cut will contribute to a set of clusters
H, and we may calculate the quality of H as
•
– Number preference
Q(H ) = 1jH j
PC i 2H
S(C i ; ¹C i )S(C i ;C i )
where ¹Ci =S
k6=i Ck is thecomplement of Ci
f (m) = 1®!̄ ®m®¡ 1e¡ m=̄ ; where m is the size of H
The partitioning (P) algorithm
• The final score combines the two
´ = Q(H )f (m)
The HAC+P algorithm
• Cluster Naming– Choose the NE with highest tf-idf score and non-
duplicate as the name
布希(George Bush)
白宮(White House)
鮑威爾(Powell)
聯合國(United Nations)
伊拉克(Iraq)
以色列(Israel)
巴勒斯坦(Palestine)
…
……
…
Bro
adca
st N
ews
Arc
hive
s
Multi-modal Dialogue for Information Navigation and Retrieval
Named EntityRecognition Broadcast News
Retrieval
Topic HierarchyConstruction Discourse and
DialogueManager
TopicHierarchy
Titles, Summaries and Complete Documents
AutomaticGeneration of
Titles andSummaries
User input(Spoken Queries or Clicks)
Ranking Algorithm
Dialogue Manager
• The purpose of the dialogue manager is to enhance the task success rate and task efficiency
• The interaction between the user and system is based on the topic hierarchy tree
• Given the user’s query context, the dialogue manager should propose smart hierarchy trees during the task
Dialogue Manager
• We propose a ranking algorithm to learn the possible users’ behaviors based on reinforcement learning and rank the topic hierarchies properly
Reinforcement Learning and Markov Decision Process (MDP)• An MDP can be described as a tuple
<S, A, T, R>, where– S is a finite set of states
• We define the states as the current retrieved documents under current sequence of queries
– A is a finite set of actions• We define the possible actions as the different k
ey terms proposed to the user in the topic hierarchy for further clarification
Key Term Space and Document Space
Key Term Space Archive Space
titj
tktl
C(ti)
C(tj)
C(tk)
s1 = [ti ]
s2 = [ti ,tj ]
s3 = [ti ,tk ]
sn = [ti ,tj ,tl ]
G1 = C(ti )
G2 = C(ti ) C(tj)
G3 = C(ti ) C(tk)
Gn = C(ti ) C(tj) C(tl)
Such mapping is defined by some IR function (ex: PLSA)
states:
∪ ∪
∪∪
• A tree generated for each user given the initial query s1
• The user is assumed satisfied and represented with double circles when Recall Rate = L/|D| > τ0
L: number of relevant documents appearing in the top K retrieved documents
D: desired document set τ0 : a predefined threshold
• m(s) = min: number of steps or queries to arrive at the final state
15s
State Transition Diagram
s1
s2
s3
s4
s6
s7
s8s13
s14
s12m(s12) = 4
m(s7) = 3
m(s13) = 4
m(s15) = 5
s9m(s9) = 3
m(s4) = 2
…
Simulated Users• Ranking the key terms by large number of
simulated users• Each simulated user is characterized by a
desired document set D and an initial query s1
– D automatically generated with a randomly determined size by clustering the document archive and randomly selecting key terms covering small subsets of the clusters
• Each simulated user produces a tree of states
• Each state on the tree is scored by the minimum number of steps or queries to arrive at the final state (user satisfied) when going through the state
Ranking Algorithm
• With the same initial query, the basic tree structure is fixed for all simulated users– Different simulated users with different desired
document set D truncates the same tree at the different level
• With huge number of simulated users, an averaged tree for each initial query (which is a key term) is trained
• Averaged scores of the states of the averaged tree give the ranking of the key terms
Evaluations
Performance of individual modules
• NER module– Text
• MET-2 test set– Spoken documents
• 200 broadcast news stories recorded in Sept. 2002 in Taipei
• “Yahoo! Kimo News Portal” as the external knowledge source
Experiment Cases NE Recall Precision F1 score Overall F1
Text Documents
baseline
PER 94 96 95.0
89.5LOC 89 93 91.0
ORG 87 96 91.3
PAT Tree
PER 95 96 95.5
91.1LOC 94 92 93.0
ORG 89 95 91.9
Spoken Documents
baseline
PER 71 86 77.8
77.6LOC 86 91 88.4
ORG 64 95 76.5
PAT Tree
PER 73 85 78.5
80.0LOC 87 91 89.0
ORG 67 95 78.6
PAT Tree + external
Knowledge source
PER 76 87 81.1
80.9LOC 87 90 88.5
ORG 68 95 79.3
Performance on Broadcast Performance on Broadcast News Retrieval Enhanced by News Retrieval Enhanced by
Named EntitiesNamed Entities
• A total of 1708 distinct NEs recognized from a subset of the broadcast news archives (7000 stories) were used in the LSA or PLSA training
• A total of 350 latent topics were used in either LSA or PLSA
• Incorporating NEs as extra indexing features are helpful, and the improvements achieved by PLSA (row (c)) are more significant than those by LSA (row (b))
Experiment Cases
Precision
RecallF1
score
(a) Baseline 38.99 50.54 44.02
(b) Baseline+LSA 47.03 59.70 52.61
(c) Baseline+PLSA 48.65 60.44 54.72
Performance on Topic Performance on Topic Hierarchy ConstructionHierarchy Construction
Correctness (C)Coverage ratio (P)
Discriminative ratio (d)
(1) All NEs 0.91 0.97 0.15(2) PER - 0.66 0.12(3) ORG - 0.71 0.13(4) LOC - 0.67 0.17
(5) Terms or phrases by tf·idf scores
N/A 1 0.35
The Correctness (C) is the ratio of correctly located NEs divided by the total number of all NEs. The Coverage ratio (P) is the percentage of the news stories which can be retrieved usin
g the NEs in the topic hierarchy. Discriminative ratio (d) of an NE with respect to its parent node is how efficient it is in
reducing the size of the relevant news stories when the NE is selected as an additional query term.
Total retrieved documents by the initial query
SP = {Retrieved documents by augmenting the parent node in query expansion}
SC = {Retrieved documents by augmenting the child node in query expansion}
d = |SC|/|SP|
Roughly equal
Much more efficient
Simulation based on P and d
• Assumption: the user has the ability to choose the correct term to expand his query
• In the figure (a) and (b) below, L (number of retrieved news stories with the initial query) and K (number of news stories desired by the user) are set as two random variables, where the range of L is [200,500] and of K [1,L/6]
Simulation based on P and d
(a)
d =0.4 (close to the case of terms and phrases by tf-idf scores)
- all transactions successful in 5 turns, most in 4
Simulation based on P and d
(b)
d =0.15 (the case of NEs)
- all transactions successful in 3 turns
- smaller value of d made the retrieval more efficient
Performance of the ranking algorithm
• 2000 key terms are automatically selected for an archive of 10000 documents
• 5,000,000 simulated users• wpq and tf-idf term ranking methods are co
mpared with
• wpq:
• tf-idf:
wt =pt
1¡ ptqt
1¡ qt
ill
ii n
N
f
fy log)
max
)5.0(5.0(
Performance of the ranking algorithm
0
100
200
300
400
500
600
700
800
Number of key terms
Num
ber o
f tria
ls
proposed ranking algorithm
wpq method
tf-idf approach
Testing users also simulated in the same way
Task Success Rate
Average number of key terms needed in successful
retrieval
89.2% 2.13
78.6% 3.08
33.1% 3.46
Performance of the ranking algorithm
ExperimentsTask
Success Rate
Average number of key
terms needed in successful retrieval
Proposed ranking algorithms
81.2% 2.63
wpq method 79.8% 3.11
tf-idf method 28.4% 3.54
50 Real users tested
Conclusion
• Here we presented a concept of using dialogues to guide the user to navigate across spoken document archives using a topic hierarchy
• A prototype system has been successfully developed, and a simulation approach was also proposed for performance analysis
Conclusion
• The ranking algorithm taking query contexts into consideration performs better than wpq and tf-idf
• Future work:– The hierarchy construction and the evaluation
metric can be further improved– The way we sample the user’s information nee
d D and initial query s1 may be refined – Actually all of the components can be improved
Thank you for your attention
Speaker: Yi-Cheng Pan
Advisor: Lin-Shan Lee