results of the 2000 topic detection and tracking evaluation in mandarin and english jonathan fiscus...
TRANSCRIPT
Results of the 2000 Topic Detection and Tracking Evaluation
in Mandarin and English
Jonathan Fiscus and George Doddington
What’s New in TDT 2000
• TDT3 corpus used in both 1999 and 2000– 120 topics used in the 2000 test: 60 1999 topics, 60 new topics
• Of 44K news stories, 24% wereat least singly judged YES or BRIEF
• 1999 and 2000 topics are very different in terms of size and cross-language makeup
– Annotation of new topics using search engine-guided annotation:“Use a search engine with on-topic story feedback and interactive searching techniques to limit the number stories read by annotators”
• Evaluation Protocol Changes– Only minor changes to the Tracking (Negative example stories)– Link Detection test set selection changed in light of last year’s experience
0100020003000400050006000700080009000
10000
TD
T3
To
pic
s
19
99
To
pic
Se
t
20
00
To
pic
Se
tCo
un
t of O
n T
op
ic S
tori
es
MandarinEnglish
Search-Guided Annotation: How will it affect scores?
• Simulate search-guided annotation using 1999 topics, 1999 annotations and 1999 Systems
Effectiveness of Seach Guided Annotation
0.001
0.01
0.1
1
10
100
1000
0.1 1 10ROF (#Read On-topic/#Read Off-Topic)
SG (FB=10) Pjudged
SG (FB=10) #Reads
Probability of a human reading a
judged story
Number of read stories
Stability in“Region ofInterest”
TDT Topic Tracking Task
training data
test data
on-topicunknownunknown
7 Participants:– Dragon, IBM, Texas A&M Univ., TNO, Univ. of Iowa, Univ. of Massachusetts, Univ. of
Maryland
System Goal:– To detect stories that discuss the target topic,
in multiple source streams.• Supervised Training
– Given Nt sample stories that discuss a given target topic
• Testing– Find all subsequent stories that discuss the target topic
Topic Tracking Results
0.1
1
Nor
mal
ized
Tra
ckin
g C
ost
English Translation Native Orthography
0.1
1
Nor
mal
ized
Tra
ckin
g C
ost English Translation, Nt=4,Nn=0
Native Orthography, Nt=4 Nn=2
Native Orthography, Nt=4, Nn=0
With Negative
Without Negative
Basic Condition: Newswire + BNews, reference story boundaries, English training: 1 On-topic
Challenge Condition: Newswire + BNews ASR, automatic story boundaries, English training: 4 On-topic, 2 Negative training
Topic Tracking Results(Expanded Basic Condition DET Curve)
Topic Tracking Results (Expanded Challenge Condition DET Curve)
With Negative
Without Negative
Effect of Automatic Story Boundaries
• Evaluation conditioned jointly by source and Language– Newswire, Broadcast News, English and Mandarin
Degradation due to story
boundaries source for ASR
Test Condition: NWT+Bnasr, 4 English Training Stories, Reference Boundaries
IBM1 UMass1
0.093 0.119
0.010
0.100
1.000
1999
Top
icT
rain
ing
2000
Top
icT
rain
ingN
orm
aliz
ed T
rack
ing
Cos
t
0.01
0.1
1
0.01 0.1 1
1999 Topic Training
20
00
To
pic
Tra
inin
g
Variability of Tracking Performance Based on Training Stories
• BBN ran their 1999 system on this year’s index files:– Same topics, but different
training stories– One caveat: these results based
on different “test epochs”, 2000 index files contain more stories
• There could be several reasons for the difference
• …needs future investigation
NIST Speech Group
TDT Link Detection Task
One Participant: University of Massachusetts
System Goal:– To detect whether a pair of stories discuss the same topic.
(Can be thought of as a “primitive operator” to build a variety of applications)
?
2000 Link Detection Results• A lot was learned last year:
– The test set must be properly sampled • “Linked” story pairs were selected by randomly sampling all
possible on-topic story pairs• “Unlinked” pairs were selected using all on-topic stories as one
of the pair, and a randomly chosen story was chosen as the second
– This year, the task was made multilingual
– More story pairs were used
0
10000
20000
30000
40000
50000
60000
70000
Overall Eng-to-Eng Eng-to-Man Man-to-Man
Num
ber
of S
tory
Pai
rs Linked Pairs
UnLinked PairsLink Detection Test Set Composition
Link Detection Results
• Required Condition– Multilingual texts– Newswire + Broadcast News ASR,– Reference story boundaries– 10 file decision deferral
0.31
0.17
0.330.41
0.10
1.00
UMass1
Nor
mal
ized
Lin
k D
etec
tion
Cos
t
OverallEng-to-EngMan-to-ManEng-to-Man
Overall
TDT Topic Detection Task
Three ParticipantsChinese Univ. of Hong Kong, Dragon, Univ. of Massachusetts
System Goal:– To detect topics in terms of the (clusters of) stories
that discuss them.
• “Unsupervised” topic training
• New topics must be detected as the incoming stories are processed.
• Input stories are then associated with one of the topics.
a topic!
• Required Condition (in yellow)
– Multilingual Topic Detection– Newswire+Broadcast News ASR– Automatic Story Boundaries
• Performance on the 1999 and 2000 topic sets are different
2000 Topic Detection Evaluation
0.1
1
CUHK1 Dragon1 UMass1
No
rma
lize
d D
ete
ctio
n C
ost
1999 Topics 2000 Topics 1999+2000 Topics
0.1
1
CUHK1 Dragon1 UMass1
Nor
mal
ized
Det
ectio
n C
ost
1999 Topics 2000 Topics 1999+2000 Topics
Using English Translations for Mandarin Using Native Orthography
Effect of Topic Size on Detection Performance
• The 1999 topics have more on-topic stories than the 2000 topics
• Distribution of scores are related to topic size– Bigger topics tend to have higher scores. – Is this a behavior induced by setting a topic size parameter in
training?
0.0001
0.001
0.01
0.1
1
1 10 100 1000
Number of On-Topic Stories
Nor
mal
ize
d D
etec
tion
Co
st 1999 Topics
2000 TopicsDragon1ResultsNWT+BNasr,
Reference Boundary, Multilingual Texts
Fractional Components of Detection Cost
• Evaluations conditioned on factors (like language) are problematic
• Instead, compute the additive contributions to detection costs for different subsets of data.
00.005
0.010.015
0.020.025
0.03
Nor
mal
ized
Det
ectio
n C
ost
English
Mandarin
0.00001
0.0001
0.001
0.01
0.1
1E-05 1E-04 0.001 0.01 0.1
Fractional Cost of English Errors
Fra
ctio
nal
Co
st o
f Man
dar
in E
rro
rs
1999 Topics2000 Topics
Dragon1ResultsNWT+BNasr,
Reference Boundary, Multilingual Texts
InterestingReversal
Effects of Automatic BoundariesOn Detection Performance
– Multilingual Topic Detection– Newswire+Broadcast News ASR– Reference Vs. Automatic Story Boundaries
0.1
1
Dragon1 Dragon2 UMass1
Norm
aliz
ed D
ete
ctio
n
Cost
Reference Boundaries
Automatic Boundaries
19%, 21% and 41%relative increase in
cost respectively
TDT Segmentation Task
Transcription:
text (words)Story:Non-story:
One Participant: MITRE(For TDT 2000, Story segmentation is an integral part of the other tasks, not just a separate evaluation task)
System Goal:– To segment the source stream into its constituent stories, for all audio sources.
Story Segmentation Results
• Required Condition: – Broadcast News ASR
MITRE Segmentation Results
0.1
1.0
1999System
2000System
No
rma
lize
d S
eg
me
nta
tio
n C
os
t
English - 10KW Deferral
Mandarin - 15KC Deferral
TDT First Story Detection (FSD) Task
Two Participants:National Taiwan University and University of Massachusetts
System Goal:– To detect the first story that discusses a topic,
for all topics.
• Evaluating “part” of a Topic Detection system, (i.e., when to start a new cluster)
First Stories on two topics
Not First Stories
= Topic 1= Topic 2
UMass Historical PerformanceUsing Reference Story Boundaries
0.76 0.64
0.10
1.00
10.00
UMass
Norm
aliz
ed F
SD
Cost 1999 System
2000 System
First Story Detection Results
• Required Condition: – English Newswire and Broadcast News ASR transcripts– Automatic story boundaries– One file decision deferral
0.1
1
10
NWT_Bnasr,Ref. Bnd
NWT+Bnasr,Auto Bnd
NWT+Bnman,Ref. Bnd.
Nor
mal
ized
FS
D C
ost
NTU1
UMass1
RequiredCondition
1999 and 2000 Topic Set Differences in FSD Evaluation
For UMass there is a slight difference,but a markeddifference for the NTU system
Summary
• Many, many things remaining to look at– Results appear to be a function of topic size and topic
set in the detection task, but it’s unclear why.• The re-usability of last year’s detection system outputs enable
valuable studies• Conditioned detection evaluation should be replaced with a
“contribution to cost” model
– Performance variability on tracking training stories should be further investigated
– …and the list goes on
• When should the annotations be released?• Need to find cost effective annotation technique
– Consider TREC ad-hoc style annotation via simulation