new event detection at umass amherst giridhar kumaran and james allan
DESCRIPTION
CIIR, UMass Amherst3 Systems fielded Submitted four systems Didn’t include last year’s system Classification according to LDC categories and term – pruning Didn’t work on exclusively NW story corpusTRANSCRIPT
![Page 1: New Event Detection at UMass Amherst Giridhar Kumaran and James Allan](https://reader033.vdocument.in/reader033/viewer/2022052918/5a4d1bb17f8b9ab0599ccbe1/html5/thumbnails/1.jpg)
New Event Detection at UMass Amherst
Giridhar Kumaran andJames Allan
![Page 2: New Event Detection at UMass Amherst Giridhar Kumaran and James Allan](https://reader033.vdocument.in/reader033/viewer/2022052918/5a4d1bb17f8b9ab0599ccbe1/html5/thumbnails/2.jpg)
CIIR, UMass Amherst 2
Preprocessing
Lemur Toolkit for tokenization, stopping, k-stemming http://www-2.cs.cmu.edu/~lemur/
BBN Identifinder™ for extracting named entities
![Page 3: New Event Detection at UMass Amherst Giridhar Kumaran and James Allan](https://reader033.vdocument.in/reader033/viewer/2022052918/5a4d1bb17f8b9ab0599ccbe1/html5/thumbnails/3.jpg)
CIIR, UMass Amherst 3
Systems fielded
Submitted four systems Didn’t include last year’s system
Classification according to LDC categories and term – pruning
Didn’t work on exclusively NW story corpus
![Page 4: New Event Detection at UMass Amherst Giridhar Kumaran and James Allan](https://reader033.vdocument.in/reader033/viewer/2022052918/5a4d1bb17f8b9ab0599ccbe1/html5/thumbnails/4.jpg)
CIIR, UMass Amherst 4
Primary system – UMass1
Utility of named entities acknowledged
Failure analysis indicates Large number of old stories have low
confidence score (false alarms) Conflict with new story scores Reasons
Stories on multiple topics Diffuse topics Varying document lengths
![Page 5: New Event Detection at UMass Amherst Giridhar Kumaran and James Allan](https://reader033.vdocument.in/reader033/viewer/2022052918/5a4d1bb17f8b9ab0599ccbe1/html5/thumbnails/5.jpg)
CIIR, UMass Amherst 5
Primary system – UMass1
Focus Identify old stories better – affects cost
Clue Most old stories get low confidence
scores as topics linked by only named entities (large number) only non-named entities (few)
![Page 6: New Event Detection at UMass Amherst Giridhar Kumaran and James Allan](https://reader033.vdocument.in/reader033/viewer/2022052918/5a4d1bb17f8b9ab0599ccbe1/html5/thumbnails/6.jpg)
CIIR, UMass Amherst 6
Primary system – UMass1
Approach Look at the set of closest matching
stories If consistently high named entity or
non-named entity match modify confidence score
![Page 7: New Event Detection at UMass Amherst Giridhar Kumaran and James Allan](https://reader033.vdocument.in/reader033/viewer/2022052918/5a4d1bb17f8b9ab0599ccbe1/html5/thumbnails/7.jpg)
CIIR, UMass Amherst 7
Primary system – UMass1
Procedure Double original confidence score if less
than a threshold Gradually reduce score towards original
score if set of closest stories match neither named entities nor non-named entities
![Page 8: New Event Detection at UMass Amherst Giridhar Kumaran and James Allan](https://reader033.vdocument.in/reader033/viewer/2022052918/5a4d1bb17f8b9ab0599ccbe1/html5/thumbnails/8.jpg)
CIIR, UMass Amherst 8
UMass1 – Examples from TDT3
Russian Financial Crisis - Old Story APW19981020.0237 AllSim NESim noNESim
APW19981015.0139 0.278 0.273 0.270
APW19981009.0790 0.251 0.366 0.178
APW19981016.0669 0.237 0.423 0.166
APW19981006.0509 0.211 0.359 0.107
APW19981013.0582 0.206 0.395 0.056
APW19981006.0229 0.196 0.510 0.047
![Page 9: New Event Detection at UMass Amherst Giridhar Kumaran and James Allan](https://reader033.vdocument.in/reader033/viewer/2022052918/5a4d1bb17f8b9ab0599ccbe1/html5/thumbnails/9.jpg)
CIIR, UMass Amherst 9
UMass1 – Examples from TDT3
Russian Financial Crisis - Old Story APW19981020.0237 AllSim NESim noNESim
APW19981015.0139 0.278 0.273 0.270
APW19981009.0790 0.251 0.366 0.178
APW19981016.0669 0.237 0.423 0.166
APW19981006.0509 0.211 0.359 0.107
APW19981013.0582 0.206 0.395 0.056
APW19981006.0229 0.196 0.510 0.047
![Page 10: New Event Detection at UMass Amherst Giridhar Kumaran and James Allan](https://reader033.vdocument.in/reader033/viewer/2022052918/5a4d1bb17f8b9ab0599ccbe1/html5/thumbnails/10.jpg)
CIIR, UMass Amherst 10
UMass1 – Examples from TDT3
Russian Financial Crisis - Old Story APW19981020.0237 AllSim NESim noNESim
APW19981015.0139 0.278 0.273 0.270
APW19981009.0790 0.251 0.366 0.178
APW19981016.0669 0.237 0.423 0.166
APW19981006.0509 0.211 0.359 0.107
APW19981013.0582 0.206 0.395 0.056
APW19981006.0229 0.196 0.510 0.047
Threshold = 0.1
![Page 11: New Event Detection at UMass Amherst Giridhar Kumaran and James Allan](https://reader033.vdocument.in/reader033/viewer/2022052918/5a4d1bb17f8b9ab0599ccbe1/html5/thumbnails/11.jpg)
CIIR, UMass Amherst 11
UMass1 – Examples from TDT3
Russian Financial Crisis - Old Story APW19981020.0237 AllSim NESim noNESim
APW19981015.0139 0.278 0.273 0.270
APW19981009.0790 0.251 0.366 0.178
APW19981016.0669 0.237 0.423 0.166
APW19981006.0509 0.211 0.359 0.107
APW19981013.0582 0.206 0.395 0.056
APW19981006.0229 0.196 0.510 0.047
Threshold = 0.1
![Page 12: New Event Detection at UMass Amherst Giridhar Kumaran and James Allan](https://reader033.vdocument.in/reader033/viewer/2022052918/5a4d1bb17f8b9ab0599ccbe1/html5/thumbnails/12.jpg)
CIIR, UMass Amherst 12
UMass1 – Examples from TDT3
Russian Financial Crisis - Old Story APW19981020.0237 AllSim NESim noNESim
APW19981015.0139 0.278*1.6 0.273 0.270
APW19981009.0790 0.251 0.366 0.178
APW19981016.0669 0.237 0.423 0.166
APW19981006.0509 0.211 0.359 0.107
APW19981013.0582 0.206 0.395 0.056
APW19981006.0229 0.196 0.510 0.047
Threshold = 0.1
![Page 13: New Event Detection at UMass Amherst Giridhar Kumaran and James Allan](https://reader033.vdocument.in/reader033/viewer/2022052918/5a4d1bb17f8b9ab0599ccbe1/html5/thumbnails/13.jpg)
CIIR, UMass Amherst 13
UMass1 – Examples from TDT3
Thai Airbus Crash - New Story APW19981211.0623 AllSim NESim noNESim
APW19981022.0205 0.250*1.2 0.154 0.341
APW19981110.0229 0.184 0.052 0.282
APW19981113.0905 0.155 0.003 0.228
APW19981002.0557 0.152 0.234 0.012
APW19981114.0396 0.149 0.042 0.245
APW19981006.0511 0.143 0.031 0.251
![Page 14: New Event Detection at UMass Amherst Giridhar Kumaran and James Allan](https://reader033.vdocument.in/reader033/viewer/2022052918/5a4d1bb17f8b9ab0599ccbe1/html5/thumbnails/14.jpg)
CIIR, UMass Amherst 14
UMass1 on TDT3
![Page 15: New Event Detection at UMass Amherst Giridhar Kumaran and James Allan](https://reader033.vdocument.in/reader033/viewer/2022052918/5a4d1bb17f8b9ab0599ccbe1/html5/thumbnails/15.jpg)
CIIR, UMass Amherst 15
UMass1 on TDT3
![Page 16: New Event Detection at UMass Amherst Giridhar Kumaran and James Allan](https://reader033.vdocument.in/reader033/viewer/2022052918/5a4d1bb17f8b9ab0599ccbe1/html5/thumbnails/16.jpg)
CIIR, UMass Amherst 16
UMass2
Basic vector space model system Compare with all preceding stories Return highest cosine match
![Page 17: New Event Detection at UMass Amherst Giridhar Kumaran and James Allan](https://reader033.vdocument.in/reader033/viewer/2022052918/5a4d1bb17f8b9ab0599ccbe1/html5/thumbnails/17.jpg)
CIIR, UMass Amherst 17
UMass3
Same model as UMass2 TDT5 – Very large collection Practical system Compare with a maximum of 25000
stories with highest coordination match Faster
![Page 18: New Event Detection at UMass Amherst Giridhar Kumaran and James Allan](https://reader033.vdocument.in/reader033/viewer/2022052918/5a4d1bb17f8b9ab0599ccbe1/html5/thumbnails/18.jpg)
CIIR, UMass Amherst 18
UMass4
Similar to UMass1 Rationale is the same Consider top five matches Use different formula for modifying
confidence score
![Page 19: New Event Detection at UMass Amherst Giridhar Kumaran and James Allan](https://reader033.vdocument.in/reader033/viewer/2022052918/5a4d1bb17f8b9ab0599ccbe1/html5/thumbnails/19.jpg)
CIIR, UMass Amherst 19
Performance Summary
SystemTopic
weighted min. cost (TDT5)
Topic weighted min. cost (TDT4)
UMass1 – Modify confidence score based
on evidence0.8790 0.5055
UMass2 – Basic vector space model 0.8387 0.5404
UMass3 – UMass2 + restriction on number of
documents compared with
0.8479 0.5404
UMass4 – UMass1 with different formula 0.9213 --
![Page 20: New Event Detection at UMass Amherst Giridhar Kumaran and James Allan](https://reader033.vdocument.in/reader033/viewer/2022052918/5a4d1bb17f8b9ab0599ccbe1/html5/thumbnails/20.jpg)
CIIR, UMass Amherst 20
Summary
Basic vector space model did the best
Restricting number of stories to be compared with Improved system speed Didn’t improve performance
Primary system did extremely well on training data, but failed on TDT5