classification technology at lexisnexis sigir 2001 workshop on operational text classification mark...

15
Classification Technology at LexisNexis SIGIR 2001 Workshop on Operational Text Classification Mark Wasson LexisNexis [email protected] September 13, 2001

Upload: britton-price

Post on 29-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Classification Technology at LexisNexis SIGIR 2001 Workshop on Operational Text Classification Mark Wasson LexisNexis mark.wasson@lexisnexis.com September

Classification Technology at LexisNexis

SIGIR 2001 Workshop on Operational Text

Classification

Mark Wasson

LexisNexis

[email protected]

September 13, 2001

Page 2: Classification Technology at LexisNexis SIGIR 2001 Workshop on Operational Text Classification Mark Wasson LexisNexis mark.wasson@lexisnexis.com September

Our Boolean Origins

Page 3: Classification Technology at LexisNexis SIGIR 2001 Workshop on Operational Text Classification Mark Wasson LexisNexis mark.wasson@lexisnexis.com September

Our Boolean Origins

Page 4: Classification Technology at LexisNexis SIGIR 2001 Workshop on Operational Text Classification Mark Wasson LexisNexis mark.wasson@lexisnexis.com September

• The Topic Identification System Model– Term-based Topic Identification (TTI)– Term Mapping System– Company Concept Indexing– Named Entity Indexing (Companies, People,

Organizations, Places)– Subject Indexing Prototype (not released)– NEXIS Topical Indexing

The Topic Identification System

Page 5: Classification Technology at LexisNexis SIGIR 2001 Workshop on Operational Text Classification Mark Wasson LexisNexis mark.wasson@lexisnexis.com September

• Propositional Language Model Underlies Surface Forms

• Word Concepts• Semantic Priming, Additive up to a Point• Spreading Activation

Psycholinguistics Features

Page 6: Classification Technology at LexisNexis SIGIR 2001 Workshop on Operational Text Classification Mark Wasson LexisNexis mark.wasson@lexisnexis.com September

• All words and phrases are searchable – no stop words

• No automatic morphological or thesaurus expansion– Exception – name variant generation, but

subject to human verification

• Word Concept: a set of functionally equivalent terms with respect to a given topic; 1 to 100s of terms in a single word concept

Terms and Word Concepts

Page 7: Classification Technology at LexisNexis SIGIR 2001 Workshop on Operational Text Classification Mark Wasson LexisNexis mark.wasson@lexisnexis.com September

• Frequency & weighting at word concept level rather than at individual term level

• TTI used chi-square to compare individual word concepts to supervised training set

• TTI used stepwise linear regression to test in combination and suggest weights

• Allow both positive and negative weights in addition to absolute yes/no Boolean functionality

Frequency & Weighting

Page 8: Classification Technology at LexisNexis SIGIR 2001 Workshop on Operational Text Classification Mark Wasson LexisNexis mark.wasson@lexisnexis.com September

5 documents: 3 relevant (G), 2 irrelevant (B)

W1 in G1, G2, B1

W2 in G2, G3, B2

W3 in G1, G3, B1

Each W by itself produces 67% recall, 67% precision

W1 + W2 -> 100% recall, 60% precision

W1 + W3 -> 100% recall, 75% precision

W2 + W3 -> 100% recall, 60% precision

W1 + W2 + W3 -> 100% recall, 60% precision

Also, fewer terms -> faster processing

Problem Word Concepts

Page 9: Classification Technology at LexisNexis SIGIR 2001 Workshop on Operational Text Classification Mark Wasson LexisNexis mark.wasson@lexisnexis.com September

• Count a term extra in key document parts– Headlines– Leading text– Captions

• Count all potential matches– American gets counted for 100s of companies

• Don’t count a term when part of another– Mead in Mead Corp.– French in French Fry

Looking Up Terms in Documents

Page 10: Classification Technology at LexisNexis SIGIR 2001 Workshop on Operational Text Classification Mark Wasson LexisNexis mark.wasson@lexisnexis.com September

• Summation of frequency * weight across all word concepts

• Normalize score• Compare to threshold– Verification range in TTI– Major references, strong passing references,

weak passing references in indexing tools

• Add controlled vocabulary term or marker to document if score >= threshold– Add score, any associated secondary CVTs

Calculating Topic Scores

Page 11: Classification Technology at LexisNexis SIGIR 2001 Workshop on Operational Text Classification Mark Wasson LexisNexis mark.wasson@lexisnexis.com September

• Similar field functions, different field names and locations

• Database and file information to guide production processes

The source specification file allows us to reuse a single topic definition across a wide variety of sources and source types

Source-dependent, -independent

Page 12: Classification Technology at LexisNexis SIGIR 2001 Workshop on Operational Text Classification Mark Wasson LexisNexis mark.wasson@lexisnexis.com September

• Build each definition using iterative manual process

• Use supervised learning?– TTI’s chi-square and regression– Cost of creating training samples

• Automate repetitive, labor-intensive tasks– Generate name variants

• Cheap labor cost – few minutes to 8 hours

Manual vs. Automatic

Page 13: Classification Technology at LexisNexis SIGIR 2001 Workshop on Operational Text Classification Mark Wasson LexisNexis mark.wasson@lexisnexis.com September

• Business unit benchmarks prior to adoption

• Development process test cases• Internal benchmarks with 3rd party

technologies• Sorry, not TREC

• Most tests, topics, sources – recall and precision both in the 90-95% range

Test, Test, Test

Page 14: Classification Technology at LexisNexis SIGIR 2001 Workshop on Operational Text Classification Mark Wasson LexisNexis mark.wasson@lexisnexis.com September

• TIS Model? 16 years old• TTI? In production for 11 years• Term Mapping? 9 years old• Entity Indexing? 6-7 years old• Topical Indexing? 3 years old

• Complemented by SRA NetOwl-based indexing 2 years ago

• No movement afoot to replace any of them

The End?

Page 15: Classification Technology at LexisNexis SIGIR 2001 Workshop on Operational Text Classification Mark Wasson LexisNexis mark.wasson@lexisnexis.com September

• TTI– Leigh, S. (1991). The Use of Natural Language

Processing in the Development of Topic Specific Databases. Proceedings of the 12th National Online Meeting.

• Company Concept Indexing– Wasson, M. (2000). Large-scale Controlled

Vocabulary Indexing for Named Entities. Proceedings of the ANLP-NAACL 2000 Conference.

Related Papers