portable classification tools mark shewhart lexisnexis 21 june 2001

40
Portable Classification Tools Mark Shewhart LexisNexis 21 June 2001

Upload: chase-daugherty

Post on 27-Mar-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Portable Classification Tools Mark Shewhart LexisNexis 21 June 2001

Portable Classification Tools

Mark Shewhart

LexisNexis

21 June 2001

Page 2: Portable Classification Tools Mark Shewhart LexisNexis 21 June 2001

Overview

• Classification Tools and Types

• Consistent Controlled Classification Schemes Across All Content

• Benefits of C.C.C.S.

• Approaches to “Portable” Classification

• Challenges

• Examples

• Q & A

Page 3: Portable Classification Tools Mark Shewhart LexisNexis 21 June 2001

Introduction

Mark Shewhart

LexisNexis

One of early innovators in building on-line databases and

search tools, with classification

Currently providing increasing range of tools, solutions and

services to support information needs of government

organizations, companies, and individuals

Page 4: Portable Classification Tools Mark Shewhart LexisNexis 21 June 2001

Uncontrolled Classification PROS

No manual development of classification algorithms or searches

Aids in knowledge discovery & taxonomy development

Adapts to changing terminology and topics

CONS

Difficulty providing meaningful labels to taxonomy

Problematic on fine grained rules

Examples

Verity, Semio, SRA’s NetOwl Extractor, InXight’s Thing

Finder, LEXIS-NEXIS core-terms

Page 5: Portable Classification Tools Mark Shewhart LexisNexis 21 June 2001

Controlled Classification

Machine Leaning

Provide several hundred “on-point” samples per topic

Most systems do not allow for manual intervention

Examples - Verity, Semio, Autonomy, InXight, Purple Yogi,

Webmind, Fulcrum, SmartLogik.

Manually Created “Algorithms”

Human Indexers manually create the algorithm for each topic

Examples - Any Boolean Search Engine, Verity, InXight

Classifier, LEXIS-NEXIS SmartIndexing, Factiva Intelligent

Indexing, Metacode, Sageware.

Page 6: Portable Classification Tools Mark Shewhart LexisNexis 21 June 2001

Basic search tools with complex queries created by domain

experts is a form of controlled classification

Natural Language

Verity, Alta-Vista, LexisNexis, West ...

Boolean

MS Site Server, Alta-Vista, LexisNexis, West, Factiva,

Dialog ...

Enhanced - additional “beyond boolean” operators/control

Verity, Semio ...

Controlled Classification

Page 7: Portable Classification Tools Mark Shewhart LexisNexis 21 June 2001

Taxonomy Development

Several companies market tools focused on taxonomy

development

Knowledge Discovery

Relationships between terms

New or changing terms

Uses for Uncontrolled Classification

Page 8: Portable Classification Tools Mark Shewhart LexisNexis 21 June 2001

Consistent Classification Scheme Everywhere

Your Intranet, The Web, and Premium Content Providers

Search all three using the same taxonomy

A consistent, controlled, classification scheme facilitates

data analysis & visualization - BIZ360, I2

Intra-document linking by taxonomy nodes

Investigative Analysis of content

Page 9: Portable Classification Tools Mark Shewhart LexisNexis 21 June 2001

Consistent Classification - One Stop SearchPremium Content

Your Intranet

Web Content

One Stop Search

Mining

Page 10: Portable Classification Tools Mark Shewhart LexisNexis 21 June 2001

Consistent Classification - Locate & LinkDossier

Case Law

Patents

Computer Company News

Computing & Tech News

Microsoft News

Case with Microsoft as a Party

Explore LEXIS-NEXIS for Microsoft

Microsoft Web Site

Page 11: Portable Classification Tools Mark Shewhart LexisNexis 21 June 2001

Company Tracking and Analysis

MICROSOFT CORP

INTEL

DELL COMPUTER CORP

Your Companies

User pre-selects companies to track.

Page 12: Portable Classification Tools Mark Shewhart LexisNexis 21 June 2001

Microsoft Corp News Coverage

0

50

100

150

200

250

3/1/

01

3/2/

01

3/3/

01

3/4/

01

3/5/

01

3/6/

01

3/7/

01

3/8/

01

Art

icle

s

Series1

Company Tracking and Analysis

MICROSOFT CORPMICROSOFT CORP

INTEL

DELL COMPUTER CORP

Your Companies

MSFT Stock Closing

020406080

100120

3/1/

01

3/2/

01

3/3/

01

3/4/

01

3/5/

01

3/6/

01

3/7/

01

3/8/

01M

SF

T C

los

ing

Series1

User selects Microsoft Corp.

Higher than average coverage flagged

Page 13: Portable Classification Tools Mark Shewhart LexisNexis 21 June 2001

Company Tracking and Analysis

MICROSOFT CORPMICROSOFT CORP

INTEL

DELL COMPUTER CORP

Your Companies Microsoft Corp News Coverage

0

100

200

300

400

3/1/

01

3/2/

01

3/3/

01

3/4/

01

3/5/

01

3/6/

01

3/7/

01

3/8/

01

3/9/

01

Articles

Series1

MSFT Stock Closing

020406080

100120

3/1/

01

3/2/

01

3/3/

01

3/4/

01

3/5/

01

3/6/

01

3/7/

01

3/8/

01

3/9/

01M

SF

T C

los

ing

Series1

The next day - User is back again

Extremely high coverage flagged

Page 14: Portable Classification Tools Mark Shewhart LexisNexis 21 June 2001

Company Tracking and Analysis

MICROSOFT CORPMICROSOFT CORP

INTEL

DELL COMPUTER CORP

Your Companies Microsoft Corp News Coverage

0

100

200

300

400

3/1/

01

3/2/

01

3/3/

01

3/4/

01

3/5/

01

3/6/

01

3/7/

01

3/8/

01

3/9/

01

Articles

Series1

MSFT Stock Closing

020406080

100120

3/1/

01

3/2/

01

3/3/

01

3/4/

01

3/5/

01

3/6/

01

3/7/

01

3/8/

01

3/9/

01M

SF

T C

los

ing

Series1

Click on the red circle for News Topic Analysis

050

100150200250300

1

Articles

Topic Analysis

EXECUTIVE CHANGES

STOCKS

LAWSUITS

BILL GATES

US DEPARTMENT OFJUSTICE

Page 15: Portable Classification Tools Mark Shewhart LexisNexis 21 June 2001

Company Tracking and Analysis

MICROSOFT CORPMICROSOFT CORP

INTEL

DELL COMPUTER CORP

Your Companies Microsoft Corp News Coverage

0

100

200

300

400

3/1/

01

3/2/

01

3/3/

01

3/4/

01

3/5/

01

3/6/

01

3/7/

01

3/8/

01

3/9/

01

Articles

Series1

MSFT Stock Closing

020406080

100120

3/1/

01

3/2/

01

3/3/

01

3/4/

01

3/5/

01

3/6/

01

3/7/

01

3/8/

01

3/9/

01M

SF

T C

los

ing

Series1

User clicks on the “STOCKS” bar for the news

050

100150200250300

1

Articles

Topic Analysis

EXECUTIVE CHANGES

STOCKS

LAWSUITS

BILL GATES

US DEPARTMENT OFJUSTICE

Page 16: Portable Classification Tools Mark Shewhart LexisNexis 21 June 2001

Answer Set Navigation

Executive Changes

Stocks

Lawsuits

050

100150200250300

1

Articles

Topic Analysis

EXECUTIVE CHANGES

STOCKS

LAWSUITS

BILL GATES

US DEPARTMENT OFJUSTICE

User clicks on Topic Analysis

More Executive Changes

More Stocks

More Lawsuits

Page 17: Portable Classification Tools Mark Shewhart LexisNexis 21 June 2001

Consistent Classification - Trending

Trend Analysis of Metadata

0

2000

4000

6000

8000

10000

12000

14000

16000

3Q 99 4Q 99 1Q 00 2Q 00

Online Trading

ElectronicCommerceInternet Crime

NEXerciseUser Selected Indexing Terms:

Download into Excel Spreadsheet

Online Trading

Electronic Commerce

Internet Crime

3Q 99 4Q 99 1Q 00 2Q 00Online Trading 192 1354 3303 15121Electronic Commerce5160 8788 13300 8558Internet Crime 680 918 1565 1426

Page 18: Portable Classification Tools Mark Shewhart LexisNexis 21 June 2001

Consistent Classification - Press Trending

Trending in the News

•International Herald Tribune (Neuilly-sur-Seine, France), July 4, 2000, Tuesday … The National Security Agency certainly features regularly in Mr. Gertz's coverage. A Lexis-Nexis search lists 132 Gertz stories in The Washington Times going

back to 1989 that have mentioned the agency. •The Washington Post, June 28, 2000,...easily discern one of the issues of greatest concern to voters: George W. Bush's position on the death penalty. A Nexis search Monday for stories mentioning Bush at least three times and the words "death penalty" or "executions" or "capital punishment" at least three …

•The New York Times, June 14, 2000, ...tally the Hotline political tip sheet keeps of how often possible vice-presidential choices merit a major media mention. Mr. Danforth had 10 mentions, compared with 49 for Gov. Tom Ridge of

Pennsylvania, No. 1 on the 53-name list.

•The Washington Times, May 05, 2000, … "A Nexis search of 'extreme right' over the past

month scored 212 mentions; a Nexis search of 'extreme left' over the past month yielded 58 items.

•MC Technology Marketing Intelligence, December 1, 1999 … We looked at such quantitative data as stock performance in 1999 and the number of press mentions (as shown in a Lexis- Nexis search),

• Fortune, October 12, 1998, … Just how addicted to cliches are financial media editors? Here's a list of fave words and the number of stock market stories in which they appeared, generated by a Lexis-Nexis search from the end of August to Sept. 11: Turmoil: 1,559; plunge: 1,260; crash: 965; correction: 860; bear market: 750; ...

Page 19: Portable Classification Tools Mark Shewhart LexisNexis 21 June 2001

Consistent Classification- Source Suggestion

Automatic Suggestion of Sources

LEXIS-NEXIS Suggest-a-Source

User Selected Indexing Term

LEXIS-NEXIS top Sources for Denver Broncos

Rocky Mountain NewsDenver PostSports NetworkAssociated Press•Seattle Post-intelligencerUSA Today•Washington Post•Orlando Sentinel•Kansas City Star•Regal-fort Worth Star•San Diego Union Tribune

LEXIS-NEXIS top Sources for IPO’s

Cable News Network F

M&A JournalAFX-Extel NewsPR NewswireBusiness WirePhillips NewsletterFinancial Times Institutional InvestIAC NewsBusiness TimesCable News NetworkAsia Intelligence WireFinancial PostNew York Post

IPOsLEXIS-NEXIS Suggest-a-Source

User Selected Indexing Term

Denver Broncos

•What are these?

Page 20: Portable Classification Tools Mark Shewhart LexisNexis 21 June 2001

Consistent Classification - More Than a Cite List

Source Analyzer

NEXIS Source Analyzer™

Dayton Daily News Topics

2697 Sports•2616 Athletes2181 Basketball1871 Campaigns & Elections1772 College Sports1503 Cities1476 Lawyers1473 Baseball & Softball1438 High School Sports1345 Violent Crime1258 Litigation1207 Sentencing1158 Judges1132 American Football1086 Fundraising 937 Television Programming 931 Deaths & Obituaries 857 Diseases & Disorders 852 Settlements & Decisions 837 Arrests

Source Analyzer™User Selected Sources:

Download into Excel Spreadsheet

Dayton Daily News

Washington Post

LA Times

NEXIS Source Analyzer™

Washington Post Topics

11410 Sports•8567 Campaigns & Elections7439 Athletes6415 Lawyers4665 Basketball4498 Violent Crime4393 Banking & Finance4265 Entertainment & Arts4155 Baseball & Softball3938 Judges3753 International Relations3703 Budget3675 College Sports3557 Cities3397 Litigation3384 Sentencing3243 Candidates3202 American Football3109 Television Programming2758 Fundraising

NEXIS Source Analyzer™

Los Angeles Times Topics

6080 Sports•3375 Cities3101 Campaigns & Elections2915 High School Sports2815 Athletes2800 Lawyers2360 Basketball2347 Baseball & Softball2341 Letters & Comments2241 College Sports2188 Violent Crime2113 San Fernando Valley1918 Television Programming1851 Litigation1793 Judges1711 Deaths & Obituaries1504 Editorials & Opinions1410 Environment1391 Television Industry1380 Sentencing

• Source Analyzer highlights Common Terms

Page 21: Portable Classification Tools Mark Shewhart LexisNexis 21 June 2001

Consistent Classification - More Than a Cite List

Source Analyzer

NEXIS Source Analyzer™

Financial Times Topics

61039 Banking & Finance32061 Mergers & Acquisitions18869 Telecommunications18112 Trade Agreements17499 Campaigns & Elections•13484 Currencies11458 Computing & Technology11121 International Relations11056 Exchange Rates11009 Privatization10229 Emerging Markets10160 Energy9015 Joint Ventures8959 Stock Indexes8680 Debt8609 Budget8606 Automakers8424 Engineering8347 Central Banks8110 Taxes

Source Analyzer™User Selected Sources:

Download into Excel Spreadsheet

Financial Times

USA Today

NEXIS Source Analyzer™

USA Today Topics

30235 Sports17591 Athletes9006 Baseball & Softball9003 College Sports8989 Basketball8287 Television Programming7501 American Football7355 Campaigns & Elections•6485 Lawyers6370 Banking & Finance5662 Olympics4975 Entertainment & Arts4884 Television Industry4469 Polls & Surveys3975 Litigation3832 Airlines3363 Judges3335 Violent Crime3331 International Relations2933 Network Television

• Source Analyzer™ highlights Common Terms

•The New Republic, JULY 26, 1999 … The U.S. section is lambasted for repeating what was reported in the American press. To prove it, Sullivan does a Nexis search on the topic of each article in a random issue and compares what he finds to The Economist. The results are not

surprising.

Page 22: Portable Classification Tools Mark Shewhart LexisNexis 21 June 2001

Reporter Analysis

What is a reporter covering?

NEXIS ByLine Analyzer™

Steve Schmidt reported Topics

13 CITIES 10 NATIONAL PARKS 10 CAMPAIGNS & ELECTIONS 8 SUBURBS 8 MARRIAGE 7 THEME PARKS 6 VIOLENT CRIME 6 SECONDARY SCHOOLS 5 SPORTS 5 PUBLIC TRANSPORTATION

ByLine Analyzer™User Selected Reporter:

Download into Excel Spreadsheet

Steve Schmidt

NEXIS ByLine Analyzer™

Steve Schmidt reported Companies

5 MICROSOFT CORP 1 WALT DISNEY CO INC 1 PACIFIC LUMBER CO 1 PACIFIC BELL 1 MAPES HOTEL 1 DESTINATION PALM BEACH 1 ALTURAS CASINO 1 ALASKA AIR GROUP INC

NEXIS ByLine Analyzer™

Steve Schmidt reported people

4 DAVID KNIGHT 3 SHAWN STINSON 3 EMILIO ESTEVEZ 3 CHARLIE SHEEN 3 BILL GATES 3 ALBERT GORE JR 2 WILLIE L BROWN 2 SCOTT HINSON 2 PETE KNIGHT 2 MICHAEL GONZALEZ

NEXIS ByLine Analyzer™

Steve Schmidt reported Organizations

4 SAN DIEGO STATE UNIVERSITY 4 FEDERAL BUREAU OF INVESTIGATION 3 SAN DIEGO CITY COUNCIL 3 NATIONAL PARK SERVICE 2 WILD HORSE ORGANIZED ASSISTANCE 2 VALLEY MIDDLE SCHOOL 2 UNIVERSITY OF CALIFORNIA (LOS ANGELES) 2 SAN DIEGO PADRES 2 HELIX HIGH SCHOOL 1 YOSEMITE INSTITUTE

Page 23: Portable Classification Tools Mark Shewhart LexisNexis 21 June 2001

Topic Analysis

Who’s involved & Who’s reporting on the recent rash of bacteria related product recalls?

NEXIS Topics Analyzer™

Top Reporters

2 ROBERT WALKER 2 NICOLE BAILEY 2 LYNNE KOZIEY 1 SHAWN OHLER 1 SARAH GREEN 1 QUINTIN ELLISON 1 MATTHEW P BLANCHARD 1 MARTHA M. HAMILTON 1 MARLENE HABIB 1 MARK BROWN 1 LYLE HARVEY 1 KATHERINE HARDING 1 KAREN CLARK LEPOOLE 1 JOHN TAYLOR 1 JESSICA HANSEN 1 IAN MCDOUGALL 1 FRED ANKLAM JR 1 DONNA CASEY 1 DINA CAPPIELLO 1 CHU SHOWWEI 1 CHRISTINE WINTER 1 BILL EGBERT 1 BARBARA DURBIN

Topic Analyzer™User Selected Topics:

Download into Excel Spreadsheet

Product Recalls

Bacteria

NEXIS Topic Analyzer™

Top related Companies

29 MOYER PACKING CO 16 IBP INC 12 PACKERLAND PACKING CO INC 11 KRAFT FOODS 6 LAKESIDE FARM INDUSTRIES 5 PHILIP MORRIS COS INC 5 FOOD SAFETY & INSPECTION SERVICE 4 SNOW BRAND MILK PRODUCTS CO LTD 3 GARDEN BOTANIKA INC 2 XL FOODS 2 STOP & SHOP SUPERMARKET CO 2 LAKESIDE PACKERS 2 GIANT FOOD STORES INC 2 DEL GOULD MEATS INC 2 COSTCO WHOLESALE CORP

Page 24: Portable Classification Tools Mark Shewhart LexisNexis 21 June 2001

Approaches

Documents

• “ASP” Service Model

Categories

Service Provider

Customer

Internet

Page 25: Portable Classification Tools Mark Shewhart LexisNexis 21 June 2001

Approaches

• Port The Classification Application to run in user’s environment

• Software

• Intellectual Capital

Page 26: Portable Classification Tools Mark Shewhart LexisNexis 21 June 2001

Approaches

• Port the Intellectual Capital to another classification system’s format & logic

Verity Users

Semio Users

Autonomy Users

Hummingbird Users

Inxight Users

Page 27: Portable Classification Tools Mark Shewhart LexisNexis 21 June 2001

Challenges

• Operator Incompatibility

• Parsing vs Inverted Word Index Tools

• Document Length Adjustments

Page 28: Portable Classification Tools Mark Shewhart LexisNexis 21 June 2001

Search Operator Compatibility• Many Boolean search systems do not have a frequency

operator - ATLEASTn( term ) at LexisNexis

• Years ago, LexisNexis noticed that many experienced

searchers were simulating a frequency operator by cascading

an existing proximity operator

– cat W/9999 cat W/9999 cat

– To simulate ATLEAST3( cat )

• How do we port an ATLEASTn() search to a system without

a proximity operator or a system that does not cascade

proximity operators?

Page 29: Portable Classification Tools Mark Shewhart LexisNexis 21 June 2001

Porting Boolean Searches - Verity Example

ATLEASTn Operator

LNG Boolean: ATLEASTn( expr )Verity:

<COMPLEMENT>( <YESNO>( <COMPLEMENT>( <AND>( <MULT/[10000/n]>( <FREQ>( expr ) ) )

) ) )

NOTE:• ATLEASTn( expr1 or expr2 or … or exprX ) is equivalent to ATLEASTn( expr1 ) or ATLEASTn(expr2 ) or … or ATLEASTn( exprX )

• ATLEASTn( expr1 and expr2 and … and exprX ) is equivalent to ATLEASTn( expr1 ) and ATLEASTn(expr2 ) and … and ATLEASTn( exprX )

Page 30: Portable Classification Tools Mark Shewhart LexisNexis 21 June 2001

Automatic Stemming - Precision IssuesMany search engines perform automatic stemming which is needed for depluralization which was assumed when the Search Advisor searches were created and tested. Unfortunately, this “stemming” allows words to match morphological variants other then singular/plurals. For example, a search on CONSTITUTION may match CONSTITUTIONAL. This causes the ported searches to retrieve documents that the LN Boolean search does not. Some possible solutions.

• Do nothing. The words are many times similar in concept. This would require more detailed domain by domain analysis.• Some search tools allow the user to put “quotes” around terms to turn off the stemming. If so, put quotes around all terms and generate additional terms in our search to simulate depluralization.•Put quotes around all terms and do NOT generate new terms. This omits depluralization as well. Huge recall hit I would imagine.

Page 31: Portable Classification Tools Mark Shewhart LexisNexis 21 June 2001

Porting Boolean Searches - Recall IssuesProximity operators are impacted by differences in the set of non-searchable “noise” words. Porting LexisNexis searches to a system with less noise words will cause some documents matched by LexisNexis’ search engine not to be retrieved.

For example, the search ATTACHED w/5 POLE matches in LN but may not in the following text

“cable attached to the hopper which the gin-pole”.

This also occurs in phrases which are W/1 (really a phrase). We may also miss documents on the term SURETY CONTRACT when LN matched it in the phrase SURETY TO THE CONTRACT

Possible solution - Increase n by 1 or 2 in the ported search. This could have precision impacts.

Page 32: Portable Classification Tools Mark Shewhart LexisNexis 21 June 2001

Porting Uncontrolled Classification Tools To Yours

.4 cat

.2 dog

.3 puppy

.4 mouse

Natural Language Search :

cat, dog, puppy, mouse

Natural Language Search :

cat, cat, cat, cat, dog, dog, puppy, puppy, puppy, mouse, mouse, mouse, mouse

New Weighted Natural Language Search that does not use TFIDF:

cat(0.4), dog(0.2), puppy(0.3), mouse(0.4)

•Many companies market uncontrolled classification tools that automatically create categories

• Many cluster terms and assign weights different than TFIDF

Page 33: Portable Classification Tools Mark Shewhart LexisNexis 21 June 2001

LN Topical Indexing to Verity Example

#SUBJECT:#CVTS:#SUBJ=CATS & DOGS EXAMPLE#TERMS:#WEIGHT=1#THRESH=5#FREQLMT=4 {fl01 = 4}#TERM01=cat#TERM01=cats#FREQLMT=4 {fl02 = 4}#TERM02=dog#TERM02=dogs

Page 34: Portable Classification Tools Mark Shewhart LexisNexis 21 June 2001

Word Concept Buckets

the #TERM01 word concept counts with a frequency limit of 4 on a scale of 0.0 to 1.0 can be represented in Verity as:

<SUM>( <AND>( <MULT/2500>( <FREQ>(“cat”) ) ),<AND>( <MULT/2500( <FREQ>( “cats” ) ) )

)

The #TERM02 word concept counts with a frequency limit of 4 on a scale of 0.0 to 1.0 is represented in Verity as:

<SUM>( <AND>( <MULT/2500>( <FREQ>(“dog”) ) ),<AND>( <MULT/2500( <FREQ>( “dogs” ) ) )

)

Page 35: Portable Classification Tools Mark Shewhart LexisNexis 21 June 2001

Word Concept Buckets

Examples of the TERM01 word concept counts (FL=4)

# cat/cats <SUM>( <AND>( <MULT/2500>( <FREQ>(“cat”) ) ),<AND>( <MULT/2500( <FREQ>( “cats” ) ) ) )

0 0.001 0.252 0.503 0.754 1.005+ 1.00

Page 36: Portable Classification Tools Mark Shewhart LexisNexis 21 June 2001

Blocking Effect

#SUBJECT:#CVTS:#SUBJ=CAT DOG EXAMPLE#TERMS:#THRESH=4#FREQLMT=5 {fl01 = 5}#TERM01=cat dog#FREQLMT=3 {fl02 = 3}#TERM02=cat#TERM02=dog#BLOCK=cat food#BLOCK=dog food

• In SmartIndexing, we do not count “cat” if it is in the phrase “cat dog”• This is the Blocking Effect• This is not natural in an Inverted word index based search systems• Very unnatural - “cats and dogs, sleeping together - total hysteria”

Page 37: Portable Classification Tools Mark Shewhart LexisNexis 21 June 2001

Blocking Effect

• Verity has the <FREQ> operator which counts term frequency without the Blocking Effect.• So the “cat” in “cat dog” is counted But …

• <LN-FREQ>(“cat”) = <FREQ>(“cat”) - <FREQ>(cat dog”) - <FREQ>(“cat food”)

We have term counts with the blocking effect ….

… Whoops! Verity does not have a <SUBTRACT> operator!

Page 38: Portable Classification Tools Mark Shewhart LexisNexis 21 June 2001

Learning to Subtract

• Introducing <LNG_SUBTRACT> ( b , a ) defined as b – a =

<COMPLEMENT>( <SUM>( <COMPLEMENT>( b ) , a ) )

Where 0<= a <= b <= 1

Follow the math ....

<COMPLEMENT>( <SUM>( <COMPLEMENT>( b ) , a ) ) ) =<COMPLEMENT>( <COMPLEMENT>( b ) + a ) ) =<COMPLEMENT>( 1 - b + a ) =1 - ( 1 - b + a ) =1 -1 + b - a =b – a

Page 39: Portable Classification Tools Mark Shewhart LexisNexis 21 June 2001

Actual Results from CATS & DOGS EXAMPLE

Cats & Dogs Test Summary expected results

Score (Doc_) 0 cat/cats 1 cat/cats 2 cat/cats 3 cat/cats 4 cat/cats 5 + cat/cats0 dog/dogs 0.0 (CD1) 0.125 (CD7) 0.25 (CD11) 0.375 (CD14) 0.50 (CD16) 0.50 (CD17)1 dog/dogs 0.125 (CD2) 0.25 (CD8) 0.375 (CD12) 0.50 (CD15) 0.625 (CD27) 0.625 (CD32)2 dog/dogs 0.25 (CD3) 0.375 (CD9) 0.50 (CD13) 0.625 (CD23) 0.750 (CD28) 0.750 (CD33)3 dog/dogs 0.375 (CD4) 0.50 (CD10) 0.625 (CD20) 0.750 (CD24) 0.875 (CD29) 0.875 (CD34)4 dog/dogs 0.50 (CD5) 0.625 (CD18) 0.750 (CD21) 0.875 (CD25) 1.00 (CD30) 1.00 (CD35)5+ dog/dogs 0.50 (CD6) 0.625 (CD19) 0.750 (CD22) 0.875 (CD26) 1.00 (CD31) 1.00 (CD36)

Cats & Dogs Test Actual Results

Score (Doc_) 0 cat/cats 1 cat/cats 2 cat/cats 3 cat/cats 4 cat/cats 5 + cat/cats0 dog/dogs 0.0000 (CD1) 0.1247 (CD7) 0.2494 (CD11) 0.3746 (CD14) 0.4997 (CD16) 0.5000 (CD17)1 dog/dogs 0.1247 (CD2) 0.2494 (CD8) 0.3742 (CD12) 0.4993 (CD15) 0.6244 (CD27) 0.6247 (CD32)2 dog/dogs 0.2494 (CD3) 0.3742 (CD9) 0.4989 (CD13) 0.6240 (CD23) 0.7492 (CD28) 0.7494 (CD33)3 dog/dogs 0.3746 (CD4) 0.4993 (CD10) 0.6240 (CD20) 0.7492 (CD24) 0.8743 (CD29) 0.8746 (CD34)4 dog/dogs 0.4997 (CD5) 0.6244 (CD18) 0.7492 (CD21) 0.8743 (CD25) 0.9994 (CD30) 0.9997 (CD35)5+ dog/dogs 0.5000 (CD6) 0.6247 (CD19) 0.7494 (CD22) 0.8743 (CD26) 0.9997 (CD31) 1.0000 (CD36)

Verity Threshold = THRESH/MAX = 5/8 = 0.625

Page 40: Portable Classification Tools Mark Shewhart LexisNexis 21 June 2001

Q & A

Mark Shewhart

Consulting Research Scientist

LexisNexis

[email protected]

937-865-6800 x4717