synaptica proquest talk taxonomy boot camp 2009

23
Taxonomies: Tools or People? TBC; Taxonomies: Tools or People? By Dave Clarke & Paula McCoy Copyright © Synaptica, LLC, 2009 www.synapticasoftware.com 06/22/22 Slide 1 When would one favor human indexing over machine indexing? An example of the human indexing effort is presented along with tools that can help with the process. An example of autocategorization is illustrated with a discussion of the reciprocal flow of information between the taxonomy management tool and the autocategorization tool. Speakers then discuss how structured vocabularies help refine categorizers and how feedback from the categorizer tool to the human editorial team contributes to the continual improvement of by Dave Clarke & Paula McCoy

Upload: synaptica-llc

Post on 23-Jan-2015

1.414 views

Category:

Technology


1 download

DESCRIPTION

Power Point presentation given by Dave Clarke, CEO, Synaptica, LLC and Paula McCoy of Proquest at Taxonomy Boot Camp 2009 in San Jose, California.

TRANSCRIPT

Page 1: Synaptica Proquest Talk Taxonomy Boot Camp 2009

Taxonomies: Tools or People?

TBC; Taxonomies: Tools or People?By Dave Clarke & Paula McCoy

Copyright © Synaptica, LLC, 2009www.synapticasoftware.com

04/10/23Slide 1

When would one favor human indexing over machine indexing? An example of the human indexing effort is presented along with tools that can help with the process. An example of autocategorization is illustrated with a discussion of the reciprocal flow of information between the taxonomy management tool and the autocategorization tool. Speakers then discuss how structured vocabularies help refine categorizers and how feedback from the categorizer tool to the human editorial team contributes to the continual improvement of the vocabularies.

byDave Clarke & Paula McCoy

Page 2: Synaptica Proquest Talk Taxonomy Boot Camp 2009

TBC; Taxonomies: Tools or People?By Dave Clarke & Paula McCoy

Copyright © Synaptica, LLC, 2009www.synapticasoftware.com

04/10/23Slide 2

HUMAN VS. MACHINE&

THE HUMAN OPTION

Dave ClarkeCEO

Synaptica, [email protected]

Page 3: Synaptica Proquest Talk Taxonomy Boot Camp 2009

Humans will invent almost anything to save time

TBC; Taxonomies: Tools or People?By Dave Clarke & Paula McCoy

Copyright © Synaptica, LLC, 2009www.synapticasoftware.com

04/10/23Slide 3

Page 4: Synaptica Proquest Talk Taxonomy Boot Camp 2009

HumansHumans MachinesMachines

SizeSize

Time-sensitivityTime-sensitivity

Generalist usersGeneralist users

Machine-readabilityMachine-readability

Conceptual-abstractionConceptual-abstraction

Expert usersExpert users

Data-structureData-structure

HomogeneityHomogeneity

Human or machine indexing – depends on the data and the user

subtle & abstractconcepts

non-textual, e.g. images, sounds

highlystructured

very highvolume

homogeneous topics

mission-critical precision & recall

noisy or incomplete results tolerable

very quickturnaround

TBC; Taxonomies: Tools or People?By Dave Clarke & Paula McCoy

Copyright © Synaptica, LLC, 2009www.synapticasoftware.com

04/10/23Slide 4

Page 5: Synaptica Proquest Talk Taxonomy Boot Camp 2009

Human indexing – the process

TBC; Taxonomies: Tools or People?By Dave Clarke & Paula McCoy

Copyright © Synaptica, LLC, 2009www.synapticasoftware.com

04/10/23Slide 5

DataDataSetSet

DataDataSetSet Controlled Controlled

VocabulariesVocabularies

1. Review the 1. Review the contentcontent

2. Consult the 2. Consult the vocabulariesvocabularies

Index TableIndex Table

3. Either tag the 3. Either tag the content item or build content item or build

an index tablean index table

Page 6: Synaptica Proquest Talk Taxonomy Boot Camp 2009

Minimize switching between screens - integrateMinimize switching between screens - integratevocabulary search / browse with content interfacevocabulary search / browse with content interface

Filter specific metadata elements to restrict lookup to Filter specific metadata elements to restrict lookup to relevant vocabularies or subsets of vocabulariesrelevant vocabularies or subsets of vocabularies

Search-as-you-type access to controlled vocabulariesSearch-as-you-type access to controlled vocabularies Tree-browse as an alternative to searchTree-browse as an alternative to search Redirect queries at any time by exploring semantic Redirect queries at any time by exploring semantic

relationshipsrelationships Inline definitional and indexer notesInline definitional and indexer notes

TBC; Taxonomies: Tools or People?By Dave Clarke & Paula McCoy

Copyright © Synaptica, LLC, 2009www.synapticasoftware.com

04/10/23Slide 6

Human indexing – a wish list of time-saving tools

Page 7: Synaptica Proquest Talk Taxonomy Boot Camp 2009

Self-correcting substitution of variants with Self-correcting substitution of variants with their preferred terms their preferred terms

Optional pre-population of possible target terms Optional pre-population of possible target terms based on text matchesbased on text matches

In-line submission of candidate terms where no appropriate In-line submission of candidate terms where no appropriate term identifiedterm identified

Optional automatic expansion of tag-set to include variants, Optional automatic expansion of tag-set to include variants, parents, children, associations, language equivalents and parents, children, associations, language equivalents and crosswalk schema equivalentscrosswalk schema equivalents

Profile templates to save user- and content-based indexing Profile templates to save user- and content-based indexing preferencespreferences

TBC; Taxonomies: Tools or People?By Dave Clarke & Paula McCoy

Copyright © Synaptica, LLC, 2009www.synapticasoftware.com

04/10/23Slide 7

Human indexing – a wish list of time-saving tools

Page 8: Synaptica Proquest Talk Taxonomy Boot Camp 2009

TBC; Taxonomies: Tools or People?By Dave Clarke & Paula McCoy

Copyright © Synaptica, LLC, 2009www.synapticasoftware.com

04/10/23Slide 8

Human indexing – Synaptica’s “IMS” Toolbox

Minimize switching between screens - integrate vocabulary Minimize switching between screens - integrate vocabulary search / browse with content interfacesearch / browse with content interface

Filter specific metadata elements to restrict lookup to relevant Filter specific metadata elements to restrict lookup to relevant vocabulariesvocabularies

Search-as-you-type access to controlled vocabulariesSearch-as-you-type access to controlled vocabularies Tree-browse and drop-down pick-list alternatives to searchTree-browse and drop-down pick-list alternatives to search Redirect queries at any time by exploring semantic relationshipsRedirect queries at any time by exploring semantic relationships Inline definitional and indexer notesInline definitional and indexer notes Self-correcting substitution of variants with their preferred terms Self-correcting substitution of variants with their preferred terms Optional pre-population of possible target terms based on text Optional pre-population of possible target terms based on text

matchesmatches In-line submission of candidate terms where no appropriate term In-line submission of candidate terms where no appropriate term

identifiedidentified Optional automatic expansion of tag-set to include variants, Optional automatic expansion of tag-set to include variants,

parents, children, associations, language equivalents and crosswalk parents, children, associations, language equivalents and crosswalk schema equivalentsschema equivalents

Profile templates to save user- and content-based indexing Profile templates to save user- and content-based indexing preferencespreferences

Page 9: Synaptica Proquest Talk Taxonomy Boot Camp 2009

TBC; Taxonomies: Tools or People?By Dave Clarke & Paula McCoy

Copyright © Synaptica, LLC, 2009www.synapticasoftware.com

04/10/23Slide 9

Human indexing – IMS Workflow Detail

Page 10: Synaptica Proquest Talk Taxonomy Boot Camp 2009

TBC; Taxonomies: Tools or People?By Dave Clarke & Paula McCoy

Copyright © Synaptica, LLC, 2009www.synapticasoftware.com

04/10/23Slide 10

Human indexing – profile set up screen shot

Page 11: Synaptica Proquest Talk Taxonomy Boot Camp 2009

TBC; Taxonomies: Tools or People?By Dave Clarke & Paula McCoy

Copyright © Synaptica, LLC, 2009www.synapticasoftware.com

04/10/23Slide 11

Human indexing – examples

1. A national library could use IMS to human index digital images and multimedia assets against a set of authority files.

2. A professional services corporation could use IMS to human index mission-critical legal documents against a taxonomy of compliance terminology.

3. A multinational electronics company could use IMS to human index product data according to product lines and families, hardware assets and other product based keyword groups.

Page 12: Synaptica Proquest Talk Taxonomy Boot Camp 2009

TBC; Taxonomies: Tools or People?By Dave Clarke & Paula McCoy

Copyright © Synaptica, LLC, 2009www.synapticasoftware.com

04/10/23Slide 12

Human indexing – conclusions

1. Like everything else in life, if we can possibly pass the task on to machines, we’d like to

2. There are some situations where machines are the only solution and there are others where human indexing is required (non-machine-readable data sets, subtle/abstract concepts, mission-critical precision-recall requirements, etc.)

3. If human indexing is required there are tools that can help speed up the process and help attain indexing consistency

4. The Synaptica “wish list” represents those time-saving tools requested by our user base over the past ten years

Page 13: Synaptica Proquest Talk Taxonomy Boot Camp 2009

TBC; Taxonomies: Tools or People?By Dave Clarke & Paula McCoy

Copyright © Proquest, Inc., 2009www.proquest.com

04/10/23Slide 13

AUTOCATEGORIZATIONA CASE STUDY USING SYNAPTICA

Paula McCoyManager, Taxonomy Development

[email protected]

Page 14: Synaptica Proquest Talk Taxonomy Boot Camp 2009

TBC; Taxonomies: Tools or People?By Dave Clarke & Paula McCoy

Copyright © Proquest, Inc., 2009www.proquest.com

04/10/23Slide 14

•Information aggregator & database producer, with content ranging from newspapers to academic/scholarly publications, in topics spanning business and management, STM (scientific, technical, medical), humanities, social science, general reference

•Abstracts/indexes more than 6,000 periodicals and newspapers

•Daily ingest of more than 60,000 new newspaper and newswire articles

•Customer base: Public and academic libraries

•End users: Academic and student researchers

Page 15: Synaptica Proquest Talk Taxonomy Boot Camp 2009

TBC; Taxonomies: Tools or People?By Dave Clarke & Paula McCoy

Copyright © Proquest, Inc., 2009www.proquest.com

04/10/23Slide 15

The Mandate:To promote discovery of all content relevant to the user’s search query

The Solution: Index and abstract as much content as possible in order to maximize the

number of “entry points” to an article.– Indexing provided for different parts of an article:

• SUBJECTS• COMPANIES• PEOPLE• LOCATIONS

– Abstracts provided for all articles of minimum length

ProQuest Search Interface

Page 16: Synaptica Proquest Talk Taxonomy Boot Camp 2009

TBC; Taxonomies: Tools or People?By Dave Clarke & Paula McCoy

Copyright © Proquest, Inc., 2009www.proquest.com

04/10/23Slide 16

A Growing Challenge:How to A&I hundred of thousands of new articles every

day?

The Only Answer:Autocategorization, or auto-indexing:

Machine-based application of index terms to a document or other object

ProQuest Search Interface

Page 17: Synaptica Proquest Talk Taxonomy Boot Camp 2009

TBC; Taxonomies: Tools or People?By Dave Clarke & Paula McCoy

Copyright © Proquest, Inc., 2009www.proquest.com

04/10/23Slide 17

The Autocategorization Solution

Basic Tenets of Autocategorization:1. Must have a controlled vocabulary in place2. Must have other controlled lists if you want to index companies, people,

locations, etc.3. Must have a way to manage your vocabularies4. Must have a way to manage the results of the autocat—no automated

indexing method is perfect

Autocat success rests upon the existence of a strong controlled vocabulary with a history of usage from which the automation software can learn.

Page 18: Synaptica Proquest Talk Taxonomy Boot Camp 2009

TBC; Taxonomies: Tools or People?By Dave Clarke & Paula McCoy

Copyright © Proquest, Inc., 2009www.proquest.com

04/10/23Slide 18

The ProQuest Approach

1) Implement Synaptica thesaurus management solution to manage 11,300+-term subject thesaurus and authority files for companies, people, and locations

2) Purchase Nstein Technologies’ Text Mining Engine solution to automate abstracting and indexing of subject and other terms

3) Train the TME to understand the usage of ProQuest thesaurus terms (3-month collaborative process)

4) Implement Nstein’s Knowledge Base Manager (TME Manager) to manage subject terms rules base

Synaptica Taxonomy Manager Nstein

Page 19: Synaptica Proquest Talk Taxonomy Boot Camp 2009

TBC; Taxonomies: Tools or People?By Dave Clarke & Paula McCoy

Copyright © Proquest, Inc., 2009www.proquest.com

04/10/23Slide 19

Thesaurus and Autocat ManagementSynaptica Thesaurus Management:• New terms added, hierarchies revised, Scope Notes added/revised • Use For (non-preferred) terms added frequently to reflect variant usages in the

indexed literature and provide additional cross-references

Nstein Autocat Management:• Nstein TME Manager tool used to manage indexing rules base for all thesaurus

terms• Autocat rules supplement and complement the underlying concept training • Autocat rules can be added, deleted, revised • Autocat rules enable autocat indexing to keep up with changes in term usages

so that new variants can be added and rules created based on current topics in the literature or in the news

Page 20: Synaptica Proquest Talk Taxonomy Boot Camp 2009

TBC; Taxonomies: Tools or People?By Dave Clarke & Paula McCoy

Copyright © Proquest, Inc., 2009www.proquest.com

04/10/23Slide 20

Synaptica-TME InteractionThesaurus management informs 2 levels of indexing: manual and

automated The thesaurus as represented in Synaptica must display all cross-

references (mainly Use refs) required by manual indexers The thesaurus as represented in Nstein must contain rules reflecting those

Use references

Term updates made in Synaptica are duplicated in Nstein via indexing rules

Use references in Synaptica point human indexers to the right term Use references in Nstein rules base point the automated indexer to the

right term

Page 21: Synaptica Proquest Talk Taxonomy Boot Camp 2009

TBC; Taxonomies: Tools or People?By Dave Clarke & Paula McCoy

Copyright © Proquest, Inc., 2009www.proquest.com

04/10/23Slide 21

Synaptica & Autocat: Benefits

• A semantic-based autocat solution indexes as well as it’s been trained that training is most successful if based on years of manual indexing using a controlled subject vocabulary combined with a rules base, autocat can produce intelligent and informed indexing

• Reviewing the results of good autocat leads to comparison with ongoing manual indexing questions about term usages rise to the surface human indexing can improve by becoming more flexible and adaptable to changes in terminology revised term usages are reflected in Synaptica

• Human indexers raise issues of new term variants and need for new terms Synaptica is updated the rules base is updated to allow autocat to capture terms better

Page 22: Synaptica Proquest Talk Taxonomy Boot Camp 2009

TBC; Taxonomies: Tools or People?By Dave Clarke & Paula McCoy

Copyright © Proquest, Inc., 2009www.proquest.com

04/10/23Slide 22

Benefits for Synaptica Thesaurus Control • Day-to-day review of automated indexing highlights correct and incorrect

term usages, leading to greater discipline in Synaptica thesaurus management to ensure human indexers remain aware of terms and their proper usage.

• The need for precision in subject terms means terms must be exact and descriptive—automated indexing will not work with vague, ambiguous terms or one-word terms with multiple meanings, like “Apologies,” “Affect,” “Articulation.” The result is a more robust and controlled subject vocabulary.

• Automated indexing will use terms in the thesaurus that human indexers may have forgotten about—leading again to revised hierarchies in Synaptica, new Scope Notes, and instant feedback to indexers.

Page 23: Synaptica Proquest Talk Taxonomy Boot Camp 2009

TBC; Taxonomies: Tools or People?By Dave Clarke & Paula McCoy

Copyright © Synaptica, LLC, 2009www.synapticasoftware.com

04/10/23Slide 23

[email protected] [email protected]

Questions?Questions?