synaptica proquest talk taxonomy boot camp 2009
DESCRIPTION
Power Point presentation given by Dave Clarke, CEO, Synaptica, LLC and Paula McCoy of Proquest at Taxonomy Boot Camp 2009 in San Jose, California.TRANSCRIPT
Taxonomies: Tools or People?
TBC; Taxonomies: Tools or People?By Dave Clarke & Paula McCoy
Copyright © Synaptica, LLC, 2009www.synapticasoftware.com
04/10/23Slide 1
When would one favor human indexing over machine indexing? An example of the human indexing effort is presented along with tools that can help with the process. An example of autocategorization is illustrated with a discussion of the reciprocal flow of information between the taxonomy management tool and the autocategorization tool. Speakers then discuss how structured vocabularies help refine categorizers and how feedback from the categorizer tool to the human editorial team contributes to the continual improvement of the vocabularies.
byDave Clarke & Paula McCoy
TBC; Taxonomies: Tools or People?By Dave Clarke & Paula McCoy
Copyright © Synaptica, LLC, 2009www.synapticasoftware.com
04/10/23Slide 2
HUMAN VS. MACHINE&
THE HUMAN OPTION
Dave ClarkeCEO
Synaptica, [email protected]
Humans will invent almost anything to save time
TBC; Taxonomies: Tools or People?By Dave Clarke & Paula McCoy
Copyright © Synaptica, LLC, 2009www.synapticasoftware.com
04/10/23Slide 3
HumansHumans MachinesMachines
SizeSize
Time-sensitivityTime-sensitivity
Generalist usersGeneralist users
Machine-readabilityMachine-readability
Conceptual-abstractionConceptual-abstraction
Expert usersExpert users
Data-structureData-structure
HomogeneityHomogeneity
Human or machine indexing – depends on the data and the user
subtle & abstractconcepts
non-textual, e.g. images, sounds
highlystructured
very highvolume
homogeneous topics
mission-critical precision & recall
noisy or incomplete results tolerable
very quickturnaround
TBC; Taxonomies: Tools or People?By Dave Clarke & Paula McCoy
Copyright © Synaptica, LLC, 2009www.synapticasoftware.com
04/10/23Slide 4
Human indexing – the process
TBC; Taxonomies: Tools or People?By Dave Clarke & Paula McCoy
Copyright © Synaptica, LLC, 2009www.synapticasoftware.com
04/10/23Slide 5
DataDataSetSet
DataDataSetSet Controlled Controlled
VocabulariesVocabularies
1. Review the 1. Review the contentcontent
2. Consult the 2. Consult the vocabulariesvocabularies
Index TableIndex Table
3. Either tag the 3. Either tag the content item or build content item or build
an index tablean index table
Minimize switching between screens - integrateMinimize switching between screens - integratevocabulary search / browse with content interfacevocabulary search / browse with content interface
Filter specific metadata elements to restrict lookup to Filter specific metadata elements to restrict lookup to relevant vocabularies or subsets of vocabulariesrelevant vocabularies or subsets of vocabularies
Search-as-you-type access to controlled vocabulariesSearch-as-you-type access to controlled vocabularies Tree-browse as an alternative to searchTree-browse as an alternative to search Redirect queries at any time by exploring semantic Redirect queries at any time by exploring semantic
relationshipsrelationships Inline definitional and indexer notesInline definitional and indexer notes
TBC; Taxonomies: Tools or People?By Dave Clarke & Paula McCoy
Copyright © Synaptica, LLC, 2009www.synapticasoftware.com
04/10/23Slide 6
Human indexing – a wish list of time-saving tools
Self-correcting substitution of variants with Self-correcting substitution of variants with their preferred terms their preferred terms
Optional pre-population of possible target terms Optional pre-population of possible target terms based on text matchesbased on text matches
In-line submission of candidate terms where no appropriate In-line submission of candidate terms where no appropriate term identifiedterm identified
Optional automatic expansion of tag-set to include variants, Optional automatic expansion of tag-set to include variants, parents, children, associations, language equivalents and parents, children, associations, language equivalents and crosswalk schema equivalentscrosswalk schema equivalents
Profile templates to save user- and content-based indexing Profile templates to save user- and content-based indexing preferencespreferences
TBC; Taxonomies: Tools or People?By Dave Clarke & Paula McCoy
Copyright © Synaptica, LLC, 2009www.synapticasoftware.com
04/10/23Slide 7
Human indexing – a wish list of time-saving tools
TBC; Taxonomies: Tools or People?By Dave Clarke & Paula McCoy
Copyright © Synaptica, LLC, 2009www.synapticasoftware.com
04/10/23Slide 8
Human indexing – Synaptica’s “IMS” Toolbox
Minimize switching between screens - integrate vocabulary Minimize switching between screens - integrate vocabulary search / browse with content interfacesearch / browse with content interface
Filter specific metadata elements to restrict lookup to relevant Filter specific metadata elements to restrict lookup to relevant vocabulariesvocabularies
Search-as-you-type access to controlled vocabulariesSearch-as-you-type access to controlled vocabularies Tree-browse and drop-down pick-list alternatives to searchTree-browse and drop-down pick-list alternatives to search Redirect queries at any time by exploring semantic relationshipsRedirect queries at any time by exploring semantic relationships Inline definitional and indexer notesInline definitional and indexer notes Self-correcting substitution of variants with their preferred terms Self-correcting substitution of variants with their preferred terms Optional pre-population of possible target terms based on text Optional pre-population of possible target terms based on text
matchesmatches In-line submission of candidate terms where no appropriate term In-line submission of candidate terms where no appropriate term
identifiedidentified Optional automatic expansion of tag-set to include variants, Optional automatic expansion of tag-set to include variants,
parents, children, associations, language equivalents and crosswalk parents, children, associations, language equivalents and crosswalk schema equivalentsschema equivalents
Profile templates to save user- and content-based indexing Profile templates to save user- and content-based indexing preferencespreferences
TBC; Taxonomies: Tools or People?By Dave Clarke & Paula McCoy
Copyright © Synaptica, LLC, 2009www.synapticasoftware.com
04/10/23Slide 9
Human indexing – IMS Workflow Detail
TBC; Taxonomies: Tools or People?By Dave Clarke & Paula McCoy
Copyright © Synaptica, LLC, 2009www.synapticasoftware.com
04/10/23Slide 10
Human indexing – profile set up screen shot
TBC; Taxonomies: Tools or People?By Dave Clarke & Paula McCoy
Copyright © Synaptica, LLC, 2009www.synapticasoftware.com
04/10/23Slide 11
Human indexing – examples
1. A national library could use IMS to human index digital images and multimedia assets against a set of authority files.
2. A professional services corporation could use IMS to human index mission-critical legal documents against a taxonomy of compliance terminology.
3. A multinational electronics company could use IMS to human index product data according to product lines and families, hardware assets and other product based keyword groups.
TBC; Taxonomies: Tools or People?By Dave Clarke & Paula McCoy
Copyright © Synaptica, LLC, 2009www.synapticasoftware.com
04/10/23Slide 12
Human indexing – conclusions
1. Like everything else in life, if we can possibly pass the task on to machines, we’d like to
2. There are some situations where machines are the only solution and there are others where human indexing is required (non-machine-readable data sets, subtle/abstract concepts, mission-critical precision-recall requirements, etc.)
3. If human indexing is required there are tools that can help speed up the process and help attain indexing consistency
4. The Synaptica “wish list” represents those time-saving tools requested by our user base over the past ten years
TBC; Taxonomies: Tools or People?By Dave Clarke & Paula McCoy
Copyright © Proquest, Inc., 2009www.proquest.com
04/10/23Slide 13
AUTOCATEGORIZATIONA CASE STUDY USING SYNAPTICA
Paula McCoyManager, Taxonomy Development
TBC; Taxonomies: Tools or People?By Dave Clarke & Paula McCoy
Copyright © Proquest, Inc., 2009www.proquest.com
04/10/23Slide 14
•Information aggregator & database producer, with content ranging from newspapers to academic/scholarly publications, in topics spanning business and management, STM (scientific, technical, medical), humanities, social science, general reference
•Abstracts/indexes more than 6,000 periodicals and newspapers
•Daily ingest of more than 60,000 new newspaper and newswire articles
•Customer base: Public and academic libraries
•End users: Academic and student researchers
TBC; Taxonomies: Tools or People?By Dave Clarke & Paula McCoy
Copyright © Proquest, Inc., 2009www.proquest.com
04/10/23Slide 15
The Mandate:To promote discovery of all content relevant to the user’s search query
The Solution: Index and abstract as much content as possible in order to maximize the
number of “entry points” to an article.– Indexing provided for different parts of an article:
• SUBJECTS• COMPANIES• PEOPLE• LOCATIONS
– Abstracts provided for all articles of minimum length
ProQuest Search Interface
TBC; Taxonomies: Tools or People?By Dave Clarke & Paula McCoy
Copyright © Proquest, Inc., 2009www.proquest.com
04/10/23Slide 16
A Growing Challenge:How to A&I hundred of thousands of new articles every
day?
The Only Answer:Autocategorization, or auto-indexing:
Machine-based application of index terms to a document or other object
ProQuest Search Interface
TBC; Taxonomies: Tools or People?By Dave Clarke & Paula McCoy
Copyright © Proquest, Inc., 2009www.proquest.com
04/10/23Slide 17
The Autocategorization Solution
Basic Tenets of Autocategorization:1. Must have a controlled vocabulary in place2. Must have other controlled lists if you want to index companies, people,
locations, etc.3. Must have a way to manage your vocabularies4. Must have a way to manage the results of the autocat—no automated
indexing method is perfect
Autocat success rests upon the existence of a strong controlled vocabulary with a history of usage from which the automation software can learn.
TBC; Taxonomies: Tools or People?By Dave Clarke & Paula McCoy
Copyright © Proquest, Inc., 2009www.proquest.com
04/10/23Slide 18
The ProQuest Approach
1) Implement Synaptica thesaurus management solution to manage 11,300+-term subject thesaurus and authority files for companies, people, and locations
2) Purchase Nstein Technologies’ Text Mining Engine solution to automate abstracting and indexing of subject and other terms
3) Train the TME to understand the usage of ProQuest thesaurus terms (3-month collaborative process)
4) Implement Nstein’s Knowledge Base Manager (TME Manager) to manage subject terms rules base
Synaptica Taxonomy Manager Nstein
TBC; Taxonomies: Tools or People?By Dave Clarke & Paula McCoy
Copyright © Proquest, Inc., 2009www.proquest.com
04/10/23Slide 19
Thesaurus and Autocat ManagementSynaptica Thesaurus Management:• New terms added, hierarchies revised, Scope Notes added/revised • Use For (non-preferred) terms added frequently to reflect variant usages in the
indexed literature and provide additional cross-references
Nstein Autocat Management:• Nstein TME Manager tool used to manage indexing rules base for all thesaurus
terms• Autocat rules supplement and complement the underlying concept training • Autocat rules can be added, deleted, revised • Autocat rules enable autocat indexing to keep up with changes in term usages
so that new variants can be added and rules created based on current topics in the literature or in the news
TBC; Taxonomies: Tools or People?By Dave Clarke & Paula McCoy
Copyright © Proquest, Inc., 2009www.proquest.com
04/10/23Slide 20
Synaptica-TME InteractionThesaurus management informs 2 levels of indexing: manual and
automated The thesaurus as represented in Synaptica must display all cross-
references (mainly Use refs) required by manual indexers The thesaurus as represented in Nstein must contain rules reflecting those
Use references
Term updates made in Synaptica are duplicated in Nstein via indexing rules
Use references in Synaptica point human indexers to the right term Use references in Nstein rules base point the automated indexer to the
right term
TBC; Taxonomies: Tools or People?By Dave Clarke & Paula McCoy
Copyright © Proquest, Inc., 2009www.proquest.com
04/10/23Slide 21
Synaptica & Autocat: Benefits
• A semantic-based autocat solution indexes as well as it’s been trained that training is most successful if based on years of manual indexing using a controlled subject vocabulary combined with a rules base, autocat can produce intelligent and informed indexing
• Reviewing the results of good autocat leads to comparison with ongoing manual indexing questions about term usages rise to the surface human indexing can improve by becoming more flexible and adaptable to changes in terminology revised term usages are reflected in Synaptica
• Human indexers raise issues of new term variants and need for new terms Synaptica is updated the rules base is updated to allow autocat to capture terms better
TBC; Taxonomies: Tools or People?By Dave Clarke & Paula McCoy
Copyright © Proquest, Inc., 2009www.proquest.com
04/10/23Slide 22
Benefits for Synaptica Thesaurus Control • Day-to-day review of automated indexing highlights correct and incorrect
term usages, leading to greater discipline in Synaptica thesaurus management to ensure human indexers remain aware of terms and their proper usage.
• The need for precision in subject terms means terms must be exact and descriptive—automated indexing will not work with vague, ambiguous terms or one-word terms with multiple meanings, like “Apologies,” “Affect,” “Articulation.” The result is a more robust and controlled subject vocabulary.
• Automated indexing will use terms in the thesaurus that human indexers may have forgotten about—leading again to revised hierarchies in Synaptica, new Scope Notes, and instant feedback to indexers.
TBC; Taxonomies: Tools or People?By Dave Clarke & Paula McCoy
Copyright © Synaptica, LLC, 2009www.synapticasoftware.com
04/10/23Slide 23
[email protected] [email protected]
Questions?Questions?