classification technology at lexisnexis sigir 2001 workshop on operational text classification mark...
TRANSCRIPT
![Page 1: Classification Technology at LexisNexis SIGIR 2001 Workshop on Operational Text Classification Mark Wasson LexisNexis mark.wasson@lexisnexis.com September](https://reader036.vdocument.in/reader036/viewer/2022082710/56649e7c5503460f94b7ddca/html5/thumbnails/1.jpg)
Classification Technology at LexisNexis
SIGIR 2001 Workshop on Operational Text
Classification
Mark Wasson
LexisNexis
September 13, 2001
![Page 2: Classification Technology at LexisNexis SIGIR 2001 Workshop on Operational Text Classification Mark Wasson LexisNexis mark.wasson@lexisnexis.com September](https://reader036.vdocument.in/reader036/viewer/2022082710/56649e7c5503460f94b7ddca/html5/thumbnails/2.jpg)
Our Boolean Origins
![Page 3: Classification Technology at LexisNexis SIGIR 2001 Workshop on Operational Text Classification Mark Wasson LexisNexis mark.wasson@lexisnexis.com September](https://reader036.vdocument.in/reader036/viewer/2022082710/56649e7c5503460f94b7ddca/html5/thumbnails/3.jpg)
Our Boolean Origins
![Page 4: Classification Technology at LexisNexis SIGIR 2001 Workshop on Operational Text Classification Mark Wasson LexisNexis mark.wasson@lexisnexis.com September](https://reader036.vdocument.in/reader036/viewer/2022082710/56649e7c5503460f94b7ddca/html5/thumbnails/4.jpg)
• The Topic Identification System Model– Term-based Topic Identification (TTI)– Term Mapping System– Company Concept Indexing– Named Entity Indexing (Companies, People,
Organizations, Places)– Subject Indexing Prototype (not released)– NEXIS Topical Indexing
The Topic Identification System
![Page 5: Classification Technology at LexisNexis SIGIR 2001 Workshop on Operational Text Classification Mark Wasson LexisNexis mark.wasson@lexisnexis.com September](https://reader036.vdocument.in/reader036/viewer/2022082710/56649e7c5503460f94b7ddca/html5/thumbnails/5.jpg)
• Propositional Language Model Underlies Surface Forms
• Word Concepts• Semantic Priming, Additive up to a Point• Spreading Activation
Psycholinguistics Features
![Page 6: Classification Technology at LexisNexis SIGIR 2001 Workshop on Operational Text Classification Mark Wasson LexisNexis mark.wasson@lexisnexis.com September](https://reader036.vdocument.in/reader036/viewer/2022082710/56649e7c5503460f94b7ddca/html5/thumbnails/6.jpg)
• All words and phrases are searchable – no stop words
• No automatic morphological or thesaurus expansion– Exception – name variant generation, but
subject to human verification
• Word Concept: a set of functionally equivalent terms with respect to a given topic; 1 to 100s of terms in a single word concept
Terms and Word Concepts
![Page 7: Classification Technology at LexisNexis SIGIR 2001 Workshop on Operational Text Classification Mark Wasson LexisNexis mark.wasson@lexisnexis.com September](https://reader036.vdocument.in/reader036/viewer/2022082710/56649e7c5503460f94b7ddca/html5/thumbnails/7.jpg)
• Frequency & weighting at word concept level rather than at individual term level
• TTI used chi-square to compare individual word concepts to supervised training set
• TTI used stepwise linear regression to test in combination and suggest weights
• Allow both positive and negative weights in addition to absolute yes/no Boolean functionality
Frequency & Weighting
![Page 8: Classification Technology at LexisNexis SIGIR 2001 Workshop on Operational Text Classification Mark Wasson LexisNexis mark.wasson@lexisnexis.com September](https://reader036.vdocument.in/reader036/viewer/2022082710/56649e7c5503460f94b7ddca/html5/thumbnails/8.jpg)
5 documents: 3 relevant (G), 2 irrelevant (B)
W1 in G1, G2, B1
W2 in G2, G3, B2
W3 in G1, G3, B1
Each W by itself produces 67% recall, 67% precision
W1 + W2 -> 100% recall, 60% precision
W1 + W3 -> 100% recall, 75% precision
W2 + W3 -> 100% recall, 60% precision
W1 + W2 + W3 -> 100% recall, 60% precision
Also, fewer terms -> faster processing
Problem Word Concepts
![Page 9: Classification Technology at LexisNexis SIGIR 2001 Workshop on Operational Text Classification Mark Wasson LexisNexis mark.wasson@lexisnexis.com September](https://reader036.vdocument.in/reader036/viewer/2022082710/56649e7c5503460f94b7ddca/html5/thumbnails/9.jpg)
• Count a term extra in key document parts– Headlines– Leading text– Captions
• Count all potential matches– American gets counted for 100s of companies
• Don’t count a term when part of another– Mead in Mead Corp.– French in French Fry
Looking Up Terms in Documents
![Page 10: Classification Technology at LexisNexis SIGIR 2001 Workshop on Operational Text Classification Mark Wasson LexisNexis mark.wasson@lexisnexis.com September](https://reader036.vdocument.in/reader036/viewer/2022082710/56649e7c5503460f94b7ddca/html5/thumbnails/10.jpg)
• Summation of frequency * weight across all word concepts
• Normalize score• Compare to threshold– Verification range in TTI– Major references, strong passing references,
weak passing references in indexing tools
• Add controlled vocabulary term or marker to document if score >= threshold– Add score, any associated secondary CVTs
Calculating Topic Scores
![Page 11: Classification Technology at LexisNexis SIGIR 2001 Workshop on Operational Text Classification Mark Wasson LexisNexis mark.wasson@lexisnexis.com September](https://reader036.vdocument.in/reader036/viewer/2022082710/56649e7c5503460f94b7ddca/html5/thumbnails/11.jpg)
• Similar field functions, different field names and locations
• Database and file information to guide production processes
The source specification file allows us to reuse a single topic definition across a wide variety of sources and source types
Source-dependent, -independent
![Page 12: Classification Technology at LexisNexis SIGIR 2001 Workshop on Operational Text Classification Mark Wasson LexisNexis mark.wasson@lexisnexis.com September](https://reader036.vdocument.in/reader036/viewer/2022082710/56649e7c5503460f94b7ddca/html5/thumbnails/12.jpg)
• Build each definition using iterative manual process
• Use supervised learning?– TTI’s chi-square and regression– Cost of creating training samples
• Automate repetitive, labor-intensive tasks– Generate name variants
• Cheap labor cost – few minutes to 8 hours
Manual vs. Automatic
![Page 13: Classification Technology at LexisNexis SIGIR 2001 Workshop on Operational Text Classification Mark Wasson LexisNexis mark.wasson@lexisnexis.com September](https://reader036.vdocument.in/reader036/viewer/2022082710/56649e7c5503460f94b7ddca/html5/thumbnails/13.jpg)
• Business unit benchmarks prior to adoption
• Development process test cases• Internal benchmarks with 3rd party
technologies• Sorry, not TREC
• Most tests, topics, sources – recall and precision both in the 90-95% range
Test, Test, Test
![Page 14: Classification Technology at LexisNexis SIGIR 2001 Workshop on Operational Text Classification Mark Wasson LexisNexis mark.wasson@lexisnexis.com September](https://reader036.vdocument.in/reader036/viewer/2022082710/56649e7c5503460f94b7ddca/html5/thumbnails/14.jpg)
• TIS Model? 16 years old• TTI? In production for 11 years• Term Mapping? 9 years old• Entity Indexing? 6-7 years old• Topical Indexing? 3 years old
• Complemented by SRA NetOwl-based indexing 2 years ago
• No movement afoot to replace any of them
The End?
![Page 15: Classification Technology at LexisNexis SIGIR 2001 Workshop on Operational Text Classification Mark Wasson LexisNexis mark.wasson@lexisnexis.com September](https://reader036.vdocument.in/reader036/viewer/2022082710/56649e7c5503460f94b7ddca/html5/thumbnails/15.jpg)
• TTI– Leigh, S. (1991). The Use of Natural Language
Processing in the Development of Topic Specific Databases. Proceedings of the 12th National Online Meeting.
• Company Concept Indexing– Wasson, M. (2000). Large-scale Controlled
Vocabulary Indexing for Named Entities. Proceedings of the ANLP-NAACL 2000 Conference.
Related Papers