bcs_seminar.ppt

Post on 11-May-2015

424 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Lexical Chains for Topic Detection and Tracking

British Classification Society

Feb 23rd 2001

Joe Carthy & Nicola StokesUniversity College Dublin

joe.carthy@ucd.ienicola.stokes@ucd.ie

http://www.cs.ucd.ie/staff/jcarthyTel. +353 1 706 2481 or 706 2469

Fax. +353 1 269 7262

Topic Detection and Tracking

• Topic Detection and Tracking (TDT)– DARPA funded TDT project with UMass, CMU and Dragon

Systems– Domain is all broadcast news: written and spoken

• TDT includes:– First story Detection – Event Tracking– Segmentation

• Applications– digital news editors– media analysts – equity traders

Topic Tracking and Detection

• Tracking may be defined as– Take a corpus of news stories– Given 1 (or 2,4,8,16) sample stories about an

event– Find all subsequent stories in the corpus

about that event

• Detection: Is this a new story ?

• Event is defined by a list of stories that discuss the event

e.g.

“Kobe earthquake”

is defined by first story that describes this event

Topic Tracking and Detection

5

UCD TDT ARCHITECTURE

SERVER

Lexical Chainer

Event Tracker Event Detector

6

Topic Detection and Tracking

DATE: 02:36TITLE: O.J. SIMPSON

Bought Knife, Murder Hearing told

O.J. SIMPSON MURDER TRIAL

NYC SUBWAY BOMBINGS

CARLOS THE JACKEL

DATA STREAM

Previous Stories

• Implemented Benchmark systems using

conventional IR techniques:

– Stemmed keywords

– Stopword removal(Porter)

– Term weighting (Robertson, Sparck Jones)

Benchmark Systems

8

Lexical Chaining

– Lexical chains - textual cohesion (Halliday & Hasan)

– Cohesion: text makes sense as a whole

– Cohesion occurs where the interpretation of one item is dependent of that of another item in the text. It is this dependency that gives rise to cohesion.

9

Lexical Chaining

– Where the cohesive elements occur over a number of sentences a cohesive chain is formed.

– For example, the sentences:

John had mud pie for dessert. Mud pie is made of chocolate. John really enjoyed it.

– give rise to the lexical chain: {mud pie, dessert, mud pie, chocolate, it}

– Lexical cohesion is as the name suggests lexical - it involves the selection of a lexical item that is in some way related to one occurring previously.

10

Lexical Chaining

– Reiteration is a form of lexical cohesion which involves the repetition of a lexical item.

This may involve simple repetition of the word but also includes the use of a synonym, near-synonym or superordinate.

For example in the sentences John bought a Jag. He loves the car. a superordinate, car, refers back to a subordinate Jag.

The part-whole relationship is also an example of lexical cohesion e.g. airplane and wing.

– A lexical chain is a sequence of related words in the text, spanning short or long distances.

11

Lexical Chaining

– A chain is independent of the grammatical structure of the text and in effect it is a list of words that captures a portion of the cohesive structure of the text.

– A lexical chain can provide a context for the resolution of an ambiguous term and enable identification of the concept the term represents i.e. word sense disambiguation

– Morris and Hirst were the first researchers to suggest the use of lexical chains to determine the structure of texts.

12

Lexical Chaining

– By identifying the lexical chains in a news story we hope to identify the focus of a news story. This can then be used in tracking and detection.

– It is important to realise that determining lexical chains is not a sophisticated natural language analysis process.

– Other Applications of Lexical Chaining• Hypertext links: Green• Summarisation: Barzilay• Segmentation: Okumura and Honda• IR: Stairmand, Ellman, Mochizuki• Malapropism detection: St. Onge• Multimedia indexing: Kazman,Al-Halimi

13

Chain Generation

– In order to construct lexical chains we must be able to identify relationships between terms.

– This is made possible by the use of WordNet

– WordNet is a computational lexicon which was developed at Princeton University.

– In WordNet, synonym sets (synsets) are used to represent concepts where a synonym set corresponds to a concept and consists of all those terms that may be used to refer to that concept.

14

Chain Generation

– For example, take the concept airplane it is represented by the synset {airplane, aeroplane, plane}.

– A WordNet synset has a numerical identifier such as 02054514.

– Links between synsets in WordNet represent conceptual relations such as synonymy, hyponymy, meronymy (part-of) etc.

– The synset identifier can be used to represent the concept referred to in the synset, for indexing and lexical chaining purposes.

15

Word Sense Disambiguation

1st Term

EXHAUST

Part of

Has a

Train 3984

Exhaust32748

Railway carriage324932

Automobile057643

Termi

CAR

Car_exhaust32748

Tire_out, Fatigue374222

16

Chain Generation

• Chaining procedure for a story:– Take the ith term in the story and generate the set Neighbouri of its

related synsets

– For each other term, if it is a member of the set Neighbouri then add it to the lexical chain for termi.

– If the lexical chain contains 3 or more elements then store the chain in a chain index file

– Repeat above for all terms in the story.

17

– Computing Chain_Sim(Trackseti, Storyj )

• Overlap Coefficient which may be defined as follows, for two lexical chains c1 and c2:

• Overlap Coefficient =

| c1 ∩ c2 |min(| c1 |, | c2 |)

18

Evaluation Metrics

– System returns a set of S documents :• a = # in S discussing new events

• b = # in S not discussing new events

• c = # in S' discussing new events

• d = # in S' not discussing new events

– Recall = a / (a+c)– Precision = a / (a+b)– Miss Rate = c / (a+c) = 1 - R– False Alarm Rate = b / (b+d) = Fallout

19

Tracking Results

Average Recall vs Threshold

0

10

20

30

40

50

60

70

80

90

100

0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6

Threshold

LexTrack-O Nt = 1

KeyTrack Nt = 1

LexTrack Nt = 1

20

Tracking Results

%Miss Rate vs Threshold

0

10

20

30

40

50

60

70

80

90

100

0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6

Threshold

LexTrack-O Nt=1

KeyTrack Nt=1

LexTrack Nt=1

21

Detection Results

Detection Performance

0

10

20

30

40

50

60

70

80

90

100

0 10 20 30 40 50 60 70 80 90 100

% False Alarms

% Misses

Lex_DetectTRADCHAINS_ONLY

22

Analysis of results

– Expected trade-off between precision and recall

– Small number of stories are sufficient to construct a tracking query

– Performance in line with other TDT researchers

– Lexical Chains - Improvement not significant ?

http://www.cs.ucd.ie/staff/jcarthy 23

TDT and Lexical Chain References

• Allan, J., Carbonell, J., Doddington, G., Yamron, J, and Yang, Y., “Topic Detection and Tracking Pilot Study: Final Report”, Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop, Morgan Kaufmann, San Francisco,1998.

• Allan, J., Papka, R., and Lavrenko, V., “Online New Event Detection and Tracking”, Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, August 1998.

• Barzilay, R., “Lexical Chains for Summarization”, M.Sc. Thesis, Ben-Gurion University of the Negev, Israel, November 1997.

• Barzilay, R., and Elhadad, M., “Using Lexical Chains for Text Summarization”, The Fifth Bar-Ilan Symposium on Foundations of Artificial Intelligence Focusing on Intelligent Agents, Bar-Ilan University, Ramat Gan, Israel, June, 1997

• Budanitsky, A., “Lexical Semantic Relatedness and its Application in Natural Language Processing”, (PhD thesis) Technical Report CSRG-390, University of Toronto, 1999.

• Ellman, J., “Using Roget's Thesaurus to Determine the Similarity of Texts”, PhD Thesis, University of Sunderland, 2000.

• Fellbaum, C., (Ed.), WordNet: An Electronic Lexical Database and Some of its Applications, MIT Press, 1998.

• Green, S.J., “Automatically Generating Hypertext by Computing Semantic Similarity”, Ph.D. Thesis, University of Toronto, 1997.

http://www.cs.ucd.ie/staff/jcarthy 24

• Halliday, M.A.K. and Hasan, R., “Cohesion In English”, Longman , 1976.

• Hatch, P., "Lexical Chaining for the Online Detection of New Events", M.Sc. Thesis, University College Dublin, 2000.

• Hirst, G., and St-Onge, D., “Lexical Chains as Representations of Context for the Detection and Correction of Malapropisms”, in WordNet: An Electronic Lexical Database and Some of its Applications, Fellbaum, C., (Ed.), MIT Press, 1998.

• Kazman, R., Al-Halimi, R., Hunt, W., and Mantei, M., “Four Paradigms for Indexing Video Conferences”, IEEE MultiMedia, 3 (1), Spring 1996.

• Mochizuki, H., Iwayama, M., and Okumura, M., “Passage Level Document Retrieval Using Lexical Chains”, RIAO 2000, Content Based Multimedia Information Access, 491-506, 2000.

• Morris J., and Hirst, G., “Lexical Cohesion, the Thesaurus, and the Structure of Text”, Computational Linguistics, 17 (1), 211-232, 1991.

• Okumura, M., and Honda, T., “Word Sense Disambiguation and Text Segmentation Based on Lexical Cohesion”, In Proceedings of the Fifteenth International Conference on Computational Linguistics (COLING-94), Vol. 2, 775-761, Kyoto, Japan, August 1994.

• Porter, M.F., “An Algorithm for Suffix Stripping”, Program, 14, 130-137, 1980.

• Robertson, S.E. and Sparck Jones, K, "Simple Approaches to Text Retrieval", University of Cambridge Computing Laboratory Technical Report Number 356, May 1997.

• Stairmand, M.A., “A Computational Analysis of Lexical Cohesion with Applications in Information Retrieval”, Ph.D. Thesis, UMIST, 1996.

• Stokes, N., Carthy, J., First Story Detection using a Composite Document Representation, HLT 2001, Human Language Technology Confererence, San Diego, California, March 18-21, 2001

• TDT2000, “The Year 2000 Topic Detection and Tracking (TDT2000) Task Definition and Evaluation Plan”, available at the following URL: http://morph.ldc.upenn.edu/TDT/Guide/manual.front.html, November 2000.

top related