interactive navigation across spoken document archives with topic hierarchies constructed by...

Interactive Navigation across

Spoken Document Archives with Topic Hierarchies

Constructed by Properly Ranked Key Terms

Speaker: Yi-Cheng Pan (Thomas)

Advisor: Lin-Shan Lee

Introduction

• In the past decade, Google successfully attracted people’s attention on Information Retrieval (IR)– Huge Information need– Explosive Information supply

• Many other possibilities for IR are now considered– Implicit/Explicit Relevance Feedback– Clustering of retrieved documents

Introduction

• Most attractive network content will be multi-media – As a result, the spoken documents associated with the

multi-media content very possibly can serve as the key for retrieval and browsing

• Very substantial research efforts have been made in recent years, and very successful techniques and systems have been developed in the area of Spoken Document Retrieval (SDR)– Robust features– Efficient utilization of lattice output of ASR results– filling up the gap between SDR and text IR

Introduction

• Maybe we should also begin to think other possibilities about SDR

Introduction

• There is a gap between a Spoken Document Retrieval (SDR) system and the user– The user knows what he needs, but not how to

translate it into a good query– The system knows exactly which query leads to

which set of documents, but needs a good mechanism to probe the user needs

• Large number of documents retrieved, including many irrelevant, for a given query

Introduction

• Subword-based features used for SDR– To bypass the OOV problem– To constrain the size of indexing features– Noisy documents retrieved by similar subw

ord-based features, e.g., “to recognize speech” v.s. “to wreck a nice beach”

• Retrieved spoken documents are difficult to browse– Difficult to be shown on the screen

Introduction

• Here, we propose to use multi-modal dialogues to help the user navigate across the spoken document archives and retrieve the desired documents based on a topic hierarchy constructed by the key terms extracted from the retrieved spoken documents– The key terms in the topic hierarchy are

properly rankedUser

TopicHierarchy

Clicking on the nodes

Retrieval System

Spoken Document

Archive

Text/spoken query

The overall system chart

Broadcast NewsArchives

Multi-modal Dialogue for Information Navigation and Retrieval

Key TermRecognition Broadcast News

Retrieval

Topic HierarchyConstruction Discourse and

DialogueManager

TopicHierarchy

Titles, Summaries and Complete Documents

AutomaticGeneration of

Titles andSummaries

User input(Spoken Queries or Clicks)

Ranking Algorithm

In the following slides, Named Entities and

Broadcast News are taken as examples for the key

terms and spoken document archives,

respectively

Bro

adca

st N

ews

Arc

hive

s


Named EntityRecognition Broadcast News

Retrieval


DialogueManager

TopicHierarchy



Titles andSummaries


Ranking Algorithm

NAMED ENTITY (NE) RECOGNITION FROM

BROADCAST NEWS• Named Entities

– PER, LOC, ORG– key for the content of the news– many heuristic rules and carefully

designed algorithms are available to recognize named entities from spoken documents

• Moreover, Named Entities can serve as one kind of semantic features for SDR

Named Entity Recognition module

• In this module, two special approaches are proposed– One for text document

• Incorporate the efficient data structure, PAT Tree

– One for spoken document• Efficiently utilize the lattice information and

the first special approach

Special approach for NER in text documents

• In traditional NER approach, the well-known local (internal/external) evidences are used

• Here we propose to further use global evidences

• The basic idea– Even Language Technologies can’t handle such pro

blem• “Even Language Technologies” v.s. “Language Technol

ogies”–記者陳永在金門報導

•陳永 v.s. 陳永在

Special approach for NER in text documents

• If we can efficiently count the frequencies and their likelihoods for being a word (considering their context) of all substrings in our document collection– incorporate such global evidences with

the two traditional local evidences into a better strategy

Efficient access and storage of global evidences : PAT

TreeBranchNodes

Leaf Nodes

0

0

0

1

A B C D E F

0

0

0 0

A:0010B:0011C:0110D:1000

E:1011F:1110

0

CB=1

CB=2CB=2

CB=3CB=4

A:0010 B:0011

C:0110

D:1000 E:1011

F:1110

CB=2

CB=1

CB=4

CB=2

CB=3

F:1110

C:0110

B:0011

A:0010

E:1011

D:1000

root

Named Entity Recognition module

• In this module, two special approaches are proposed– One for text document

• Incorporate the efficient data structure, PAT Tree

– One for spoken document• Efficiently utilize the lattice information and

the first special approach

Special approach for NER in spoken documents

• Many Named Entities are out-of-vocabulary words and can’t be obtained in the ASR output

• We propose to recover these OOV words using Internet as external knowledge sources

Internet

Anded queries

Matching with arcs with low confidence scores NE candidate

lattice Internet

Google

Relevant text documents

PAT Trees

Possible NEs

Word graph


• Possible NE generation:– Word arcs in the word graph with

higher confidence measure and higher tf-idf scores are Anded as queries to retrieve a set of relevant text documents from Internet

– NER is then performed on this set of documents and the recognized Named Entities (NEs) are pooled as NE candidates

• NE matching with the lattice:– Arcs with low confidence scores,

therefore possibly with OOVs, are matched with the possible NEs. If the similarity is higher than a threshold, it will be put into the NE candidate lattice

Internet

Anded queries

Matching with arcs with low confidence scores

NE candidate lattice

Internet

Google

Relevant text documents

PAT Trees

Possible NEs

Word graph


• NE matching with the lattice:– In order to perform the matching between two pho

ne sequences, we defined a phone similarity matrix, which is based on both the acoustic distance (from the Mahalanobis Distance between the HMM models of the phones) and the pronunciation distance (from the probability that a phone is likely to be pronounced as another phone in the pronunciation model)

• In the generated NE candidate lattice, we perform the baseline NE verification/ classification procedure


• the baseline NE verification/ classification procedure– Given a sequence of Chinese characters S = s1, ..., sn, there exis

t many possible word sequences W = w1, ...,wm (each word is composed of one to several characters) with corresponding class sequences C = c1, ..., cm, (the classes here are PN, LOC, ORG and all other words in the vocabulary) and the purpose here is to find the best word sequence W = w1

*, ...,wm* and its corr

esponding class sequence C = c1* , ..., cm

* that maximize P(C,W) as shown below:

(C¤;W¤) = argmaxC;W P (C;W)

= argmaxC;W P (WjC) ¢P (C)


• the baseline NE verification/ classification procedure

P (C) ¼P (c1) ¢P (c2jc1) ¢Q m

i=1 P (ci jci ¡ 2ci ¡ 1)

P (WjC) = P (w1 : : :wmjc1 : : :cm) ¼Q m

i=1 P (wi jci )

Bro

adca

st N

ews

Arc

hive

s



Retrieval


DialogueManager

TopicHierarchy



Titles andSummaries


Ranking Algorithm

Broadcast News retrieval enhanced by Named Entities

• The NEs recognized from broadcast news are apparently extra indexing features for SDR

• We can off-line generate the NE-document matrix and utilize such matrix in two similar ways:– LSA

• The input query vector is folded as a pseudo-document in the latent semantic space to cosine with other documents in the same space

– PLSA• P(q|d) is computed based on

P (qjd) ¼Q

i P (ti jd) =Q

i

Pk P (ti jTk) ¢P (Tkjd)

Broadcast News retrieval enhanced by Named Entities

• The above two approaches were integrated with a baseline broadcast news retrieval system based on Mandarin syllable-level indexing terms with vector space model

• For each news story the baseline system and the LSA and PLSA approaches respectively produce a score for the given query q

• The weighted sum for these scores are then used to select the retrieved news stories

Bro

adca

st N

ews

Arc

hive

s



Retrieval


DialogueManager

TopicHierarchy



Titles andSummaries


Ranking Algorithm

Topic hierarchy construction from the broadcast news

• The hierarchical organization of retrieved text documents to help the user to browse through the relevant documents has been well studied

• The extension to spoken documents is not straightforward because of the many recognition errors in the transcriptions

• Here we propose to use the relatively reliable Named Entities recognized with the special approaches to construct the topic hierarchy

Topic hierarchy construction from the broadcast news

• There are some important reasons to choose NEs rather than other terms or phrases to play this role– NEs provide high coverage for the broadcast news (alm

ost all news stories contain Named Entities) – high discriminative ability (i.e., they easily separate new

s stories addressing different topics) and thus are very useful augmented query terms

• Hierarchical Agglomerative Clustering and Partitioning algorithm (HAC+P) recently proposed for text documents (Chuang and Chien, 2005), but here performed on NEs recognized from broadcast news

The HAC+P algorithm

• This algorithm consists of two phases: – HAC-based clustering to construct a binary-

tree hierarchy – A partitioning (P) algorithm to transform the

binary-tree hierarchy to a balanced and comprehensive m-ary hierarchy considering two factors:• to have good cluster quality• to have reasonable branching number of each n

ode

The HAC+P algorithm

• HAC-based clustering– for each involved NE or key term t, we built

a feature vector vt for it by averaging the vector representations of all news stories including t

– Bottom-up clustering by average linkage and the similarity measure between two clusters Ci and Cj is defined asS(Ci ;Cj ) = 1

jC i jjC j j ¢P

vt 2C i

Pvs 2C j

cosine(vt;vs)

The HAC+P algorithm

• The partitioning (P) algorithm– Top-down fashion– Select the best cuts iteratively

C9

C1 C2 C3 C4 C5

C6

C7

C8

1

2

3

4

(a)

Cut level l C9

C1 C2 C3 C4 C5

C6

C7

(b)

The partitioning (P) algorithm

• Select the best cuts• Two factors are considered

– Cluster quality• Each cut will contribute to a set of clusters

H, and we may calculate the quality of H as

•

– Number preference

Q(H ) = 1jH j

PC i 2H

S(C i ; ¹C i )S(C i ;C i )

where ¹Ci =S

k6=i Ck is thecomplement of Ci

f (m) = 1®!̄ ®m®¡ 1e¡ m=̄ ; where m is the size of H

The partitioning (P) algorithm

• The final score combines the two

´ = Q(H )f (m)

The HAC+P algorithm

• Cluster Naming– Choose the NE with highest tf-idf score and non-

duplicate as the name

布希(George Bush)

白宮(White House)

鮑威爾(Powell)

聯合國(United Nations)

伊拉克(Iraq)

以色列(Israel)

巴勒斯坦(Palestine)

…

……

…

Bro

adca

st N

ews

Arc

hive

s



Retrieval


DialogueManager

TopicHierarchy



Titles andSummaries


Ranking Algorithm

Dialogue Manager

• The purpose of the dialogue manager is to enhance the task success rate and task efficiency

• The interaction between the user and system is based on the topic hierarchy tree

• Given the user’s query context, the dialogue manager should propose smart hierarchy trees during the task

Dialogue Manager

• We propose a ranking algorithm to learn the possible users’ behaviors based on reinforcement learning and rank the topic hierarchies properly

Reinforcement Learning and Markov Decision Process (MDP)• An MDP can be described as a tuple

<S, A, T, R>, where– S is a finite set of states

• We define the states as the current retrieved documents under current sequence of queries

– A is a finite set of actions• We define the possible actions as the different k

ey terms proposed to the user in the topic hierarchy for further clarification

Key Term Space and Document Space

Key Term Space Archive Space

titj

tktl

C(ti)

C(tj)

C(tk)

s1 = [ti ]

s2 = [ti ,tj ]

s3 = [ti ,tk ]

sn = [ti ,tj ,tl ]

G1 = C(ti )

G2 = C(ti ) C(tj)

G3 = C(ti ) C(tk)

Gn = C(ti ) C(tj) C(tl)

Such mapping is defined by some IR function (ex: PLSA)

states:

∪ ∪

∪∪

• A tree generated for each user given the initial query s1

• The user is assumed satisfied and represented with double circles when Recall Rate = L/|D| > τ0

L: number of relevant documents appearing in the top K retrieved documents

D: desired document set τ0 : a predefined threshold

• m(s) = min: number of steps or queries to arrive at the final state

15s

State Transition Diagram

s1

s2

s3

s4

s6

s7

s8s13

s14

s12m(s12) = 4

m(s7) = 3

m(s13) = 4

m(s15) = 5

s9m(s9) = 3

m(s4) = 2

…

Simulated Users• Ranking the key terms by large number of

simulated users• Each simulated user is characterized by a

desired document set D and an initial query s1

– D automatically generated with a randomly determined size by clustering the document archive and randomly selecting key terms covering small subsets of the clusters

• Each simulated user produces a tree of states

• Each state on the tree is scored by the minimum number of steps or queries to arrive at the final state (user satisfied) when going through the state

Ranking Algorithm

• With the same initial query, the basic tree structure is fixed for all simulated users– Different simulated users with different desired

document set D truncates the same tree at the different level

• With huge number of simulated users, an averaged tree for each initial query (which is a key term) is trained

• Averaged scores of the states of the averaged tree give the ranking of the key terms

Evaluations

Performance of individual modules

• NER module– Text

• MET-2 test set– Spoken documents

• 200 broadcast news stories recorded in Sept. 2002 in Taipei

• “Yahoo! Kimo News Portal” as the external knowledge source

Experiment Cases NE Recall Precision F1 score Overall F1

Text Documents

baseline

PER 94 96 95.0

89.5LOC 89 93 91.0

ORG 87 96 91.3

PAT Tree

PER 95 96 95.5

91.1LOC 94 92 93.0

ORG 89 95 91.9

Spoken Documents

baseline

PER 71 86 77.8

77.6LOC 86 91 88.4

ORG 64 95 76.5

PAT Tree

PER 73 85 78.5

80.0LOC 87 91 89.0

ORG 67 95 78.6

PAT Tree + external

Knowledge source

PER 76 87 81.1

80.9LOC 87 90 88.5

ORG 68 95 79.3

Performance on Broadcast Performance on Broadcast News Retrieval Enhanced by News Retrieval Enhanced by

Named EntitiesNamed Entities

• A total of 1708 distinct NEs recognized from a subset of the broadcast news archives (7000 stories) were used in the LSA or PLSA training

• A total of 350 latent topics were used in either LSA or PLSA

• Incorporating NEs as extra indexing features are helpful, and the improvements achieved by PLSA (row (c)) are more significant than those by LSA (row (b))

Experiment Cases

Precision

RecallF1

score

(a) Baseline 38.99 50.54 44.02

(b) Baseline+LSA 47.03 59.70 52.61

(c) Baseline+PLSA 48.65 60.44 54.72

Performance on Topic Performance on Topic Hierarchy ConstructionHierarchy Construction

Correctness (C)Coverage ratio (P)

Discriminative ratio (d)

(1) All NEs 0.91 0.97 0.15(2) PER - 0.66 0.12(3) ORG - 0.71 0.13(4) LOC - 0.67 0.17

(5) Terms or phrases by tf·idf scores

N/A 1 0.35

The Correctness (C) is the ratio of correctly located NEs divided by the total number of all NEs. The Coverage ratio (P) is the percentage of the news stories which can be retrieved usin

g the NEs in the topic hierarchy. Discriminative ratio (d) of an NE with respect to its parent node is how efficient it is in

reducing the size of the relevant news stories when the NE is selected as an additional query term.

Total retrieved documents by the initial query

SP = {Retrieved documents by augmenting the parent node in query expansion}

SC = {Retrieved documents by augmenting the child node in query expansion}

d = |SC|/|SP|

Roughly equal

Much more efficient

Simulation based on P and d

• Assumption: the user has the ability to choose the correct term to expand his query

• In the figure (a) and (b) below, L (number of retrieved news stories with the initial query) and K (number of news stories desired by the user) are set as two random variables, where the range of L is [200,500] and of K [1,L/6]


(a)

d =0.4 (close to the case of terms and phrases by tf-idf scores)

- all transactions successful in 5 turns, most in 4


(b)

d =0.15 (the case of NEs)

- all transactions successful in 3 turns

- smaller value of d made the retrieval more efficient

Performance of the ranking algorithm

• 2000 key terms are automatically selected for an archive of 10000 documents

• 5,000,000 simulated users• wpq and tf-idf term ranking methods are co

mpared with

• wpq:

• tf-idf:

wt =pt

1¡ ptqt

1¡ qt

ill

ii n

N

f

fy log)

max

)5.0(5.0(


0

100

200

300

400

500

600

700

800

Number of key terms

Num

ber o

f tria

ls

proposed ranking algorithm

wpq method

tf-idf approach

Testing users also simulated in the same way

Task Success Rate

Average number of key terms needed in successful

retrieval

89.2% 2.13

78.6% 3.08

33.1% 3.46


ExperimentsTask

Success Rate

Average number of key

terms needed in successful retrieval

Proposed ranking algorithms

81.2% 2.63

wpq method 79.8% 3.11

tf-idf method 28.4% 3.54

50 Real users tested

Conclusion

• Here we presented a concept of using dialogues to guide the user to navigate across spoken document archives using a topic hierarchy

• A prototype system has been successfully developed, and a simulation approach was also proposed for performance analysis

Conclusion

• The ranking algorithm taking query contexts into consideration performs better than wpq and tf-idf

• Future work:– The hierarchy construction and the evaluation

metric can be further improved– The way we sample the user’s information nee

d D and initial query s1 may be refined – Actually all of the components can be improved

Thank you for your attention

Speaker: Yi-Cheng Pan

Advisor: Lin-Shan Lee

interactive navigation across spoken document archives with topic hierarchies constructed by...

Documents

spoken document archives

spoken documentsmoreover

information navigation

desired documents

key terms speaker

good querythe system

multimedia content

summariesuser input