classification technology at lexisnexis sigir 2001 workshop on operational text classification mark...

Classification Technology at LexisNexis

SIGIR 2001 Workshop on Operational Text

Classification

Mark Wasson

LexisNexis

[email protected]

September 13, 2001

mailto:[email protected]

Our Boolean Origins

• The Topic Identification System Model– Term-based Topic Identification (TTI)– Term Mapping System– Company Concept Indexing– Named Entity Indexing (Companies, People,

Organizations, Places)– Subject Indexing Prototype (not released)– NEXIS Topical Indexing

The Topic Identification System

• Propositional Language Model Underlies Surface Forms

• Word Concepts• Semantic Priming, Additive up to a Point• Spreading Activation

Psycholinguistics Features

• All words and phrases are searchable – no stop words

• No automatic morphological or thesaurus expansion– Exception – name variant generation, but

subject to human verification

• Word Concept: a set of functionally equivalent terms with respect to a given topic; 1 to 100s of terms in a single word concept

Terms and Word Concepts

• Frequency & weighting at word concept level rather than at individual term level

• TTI used chi-square to compare individual word concepts to supervised training set

• TTI used stepwise linear regression to test in combination and suggest weights

• Allow both positive and negative weights in addition to absolute yes/no Boolean functionality

Frequency & Weighting

5 documents: 3 relevant (G), 2 irrelevant (B)

W1 in G1, G2, B1

W2 in G2, G3, B2

W3 in G1, G3, B1

Each W by itself produces 67% recall, 67% precision

W1 + W2 -> 100% recall, 60% precision



W1 + W2 + W3 -> 100% recall, 60% precision

Also, fewer terms -> faster processing

Problem Word Concepts

• Count a term extra in key document parts– Headlines– Leading text– Captions

• Count all potential matches– American gets counted for 100s of companies

• Don’t count a term when part of another– Mead in Mead Corp.– French in French Fry

Looking Up Terms in Documents

• Summation of frequency * weight across all word concepts

• Normalize score• Compare to threshold– Verification range in TTI– Major references, strong passing references,

weak passing references in indexing tools

• Add controlled vocabulary term or marker to document if score >= threshold– Add score, any associated secondary CVTs

Calculating Topic Scores

• Similar field functions, different field names and locations

• Database and file information to guide production processes

The source specification file allows us to reuse a single topic definition across a wide variety of sources and source types

Source-dependent, -independent

• Build each definition using iterative manual process

• Use supervised learning?– TTI’s chi-square and regression– Cost of creating training samples

• Automate repetitive, labor-intensive tasks– Generate name variants

• Cheap labor cost – few minutes to 8 hours

Manual vs. Automatic

• Business unit benchmarks prior to adoption

• Development process test cases• Internal benchmarks with 3rd party

technologies• Sorry, not TREC

• Most tests, topics, sources – recall and precision both in the 90-95% range

Test, Test, Test

• TIS Model? 16 years old• TTI? In production for 11 years• Term Mapping? 9 years old• Entity Indexing? 6-7 years old• Topical Indexing? 3 years old

• Complemented by SRA NetOwl-based indexing 2 years ago

• No movement afoot to replace any of them

The End?

• TTI– Leigh, S. (1991). The Use of Natural Language

Processing in the Development of Topic Specific Databases. Proceedings of the 12th National Online Meeting.

• Company Concept Indexing– Wasson, M. (2000). Large-scale Controlled

Vocabulary Indexing for Named Entities. Proceedings of the ANLP-NAACL 2000 Conference.

Related Papers