bridging the gap between folksonomies and taxonomies: a semantic web approach (semtech 2006)
TRANSCRIPT
Bridging the Gap Between Folksonomies and Taxonomies:A Semantic Web ApproachPresentation to STC 2006
Brad Allen, Founder and CTOSiderean Software, Inc.
Copyright © 2006 Siderean Software, Inc. All rights reserved. 2
Preface• This is not rocket science• This is appropriate semantic technology• What Jim Hendler said: it’s about linking things so the whole is
greater than the sum of the parts
Copyright © 2006 Siderean Software, Inc. All rights reserved. 3
Disclaimer• We will be viewing uncontrolled vocabulary from the Web live• Sometimes it’s not pretty• Please don’t be offended
Copyright © 2006 Siderean Software, Inc. All rights reserved. 4
The problem• Associating subject metadata with content and data is an old
technique for improving precision and recall in search• Traditionally, subject languages have been expressed as
highly-governed taxonomies (i.e., thesauri, controlled vocabularies, etc.) that entail substantial costs in creation and use
• User tagging and the emergence of folksonomies have changed the economics of subject metadata creation but at the cost of quality
• Can the two approaches to subject metadata be combined to yield an approach that yields the advantages of both while addressing their shortcomings?
Copyright © 2006 Siderean Software, Inc. All rights reserved. 5
Taxonomies• A taxonomy is a controlled subject
language whose terms exist in explicit relation to one another
• Advantages• Authoritative reference for terms and
their relational semantics
• Can support reasoning and
classification• Disadvantages
• Creation requires training and discipline
• Expensive and slow to track changes in
usage• Adoption
• Pervasive for decades throughout the
information science and IT communities
Copyright © 2006 Siderean Software, Inc. All rights reserved. 6
Folksonomies• A folksonomy is an uncontrolled
subject language whose tags have no explicit relation to one another
• Advantages• The cost of creation can be shared
across many untrained users• Can track changes in usage in real-time
• Disadvantages• Lexical variations (misspellings,
inconsistent case or white space)• Lack of relational semantics• Sense ambiguity
• Adoption• Rapid growth on the Web (del.icio.us,
Flickr) and emerging in enterprise pilots (IBM, DKW)
Copyright © 2006 Siderean Software, Inc. All rights reserved. 7
It’s an old story: neats vs. scruffies• The taxonomy/thesaurus tradition is solid• But user-generated metadata is gold• A good solution should leverage aspects of both approaches
Copyright © 2006 Siderean Software, Inc. All rights reserved. 8
Bridging the gap• The key ideas
• User tagging gets tags into repository as “author keywords”
• Ingested through RSS feeds with tagged items
• Tags are related to terms in (separately defined) taxonomies
• Users can search using one or the other or both
• Result• Folksonomies make taxonomies more responsive
• Taxonomies make folksonomies more responsible
Copyright © 2006 Siderean Software, Inc. All rights reserved. 9
Example from DCMI Conference Thesaurus
Copyright © 2006 Siderean Software, Inc. All rights reserved. 10
Building the bridge with ontologies• SKOS
• Lexical vs. concept-based thesauri
• Modeling taxonomies in SKOS
• skos:Concept
• skos:broader/skos:narrower
• skos:related
• Dublin Core (DC)• Basic asset metadata for modeling content creation
• dc:creator
• dc:dateSubmitted
Copyright © 2006 Siderean Software, Inc. All rights reserved. 11
Modeling folksonomies in SKOS and DC• Represent each tag as a skos:Concept• The prefLabel of the concept is the tag• The item is skos:subjectOf the concept• The concept is skos:inScheme associated with the RSS
channel• No broader/narrower/related relationships (at least initially)
Copyright © 2006 Siderean Software, Inc. All rights reserved. 12
Addressing the shortcomings• Reduce/eliminate lexical variation
• Merge variants into a single concept using skos:prefLabel and skos:altLabel• Relate tags to terms and other tags
• Tag the tags with categories• Place tags in time and space
• The dc:dateSubmitted of the item is associated with its tags• Geolocation metadata can be added to concepts representing physical locations
• Tags are related to other tags through shared skos:subjectOf relationships with items
• Compensate for ambiguous tags with term indexing• Index items tagged with ambiguous tags with unambiguous terms based on
context (e.g. the tag “SF”)• Allow users to exploit tags and terms concurrently
Copyright © 2006 Siderean Software, Inc. All rights reserved. 13
Social aspects• The role of the
community of interest and focused collections of edge content
• A virtuous circle where navigation and tagging continuously improve quality of subject indexing
• A disruptive impact of the economics of knowledge management
ContentConsumers
ContentProducers
(Indexed) content
Tagged content
Navigation andtagging
Navigation andtagging
Community of Interest
Copyright © 2006 Siderean Software, Inc. All rights reserved. 14
Case studies and demonstrations• Environmental Health News
• RSS item categorization
• Fac.etio.us• RSS/Atom into SKOS/FOAF/DC
• BBC Rushes• Crosswalks
Copyright © 2006 Siderean Software, Inc. All rights reserved. 15
Case study: Environmental Health News• Aggregating content from
hundreds of Web pages daily
• 105 Web pages
• 103 originating sites
• 101 editors
• 104 subscribers
• Adding value at the metadata level to the Web at large for a focused community of interest
• Policy makers
• Activists
• Researchers
Copyright © 2006 Siderean Software, Inc. All rights reserved. 16
Case study: fac.etio.us• Aggregating feeds from del.icio.us
social bookmarking site• 105 Web pages• 104 tags• 104 contributors• 104 originating sites
• Combining user tagging with faceted navigation
• “In 3 clicks, I drilled down through 9700+ sites, to a more specific set of 98 things, down to one I found useful.”
• “… the most comprehensive tool for searching the database of del.icio.us.”
• “Siderean’s half-year test makes the narrowness of the del.icio.us service evident.”
Copyright © 2006 Siderean Software, Inc. All rights reserved. 17
Case study: BBC rushes• Joint work with Accenture
Technology Labs for TRECVID program
• BBC Rushes: 49.3 hours of raw video
• 4 issues of “Summer Holiday”
(~ 2 hours)
• BBC One News (30’) + fragment
(~3’)
• Faceted navigation using both textual and visual features
Copyright © 2006 Siderean Software, Inc. All rights reserved. 18
Future work• (Semi)automatic folksonomy/taxonomy crosswalk generation
• The notion of “relatedness”
• By cooccurrence
• By explicit warrant
• Machine learning for tag sense disambiguation• Co-training using content that is simultaneously tagged and indexed
• Tag spam filtering
Siderean Software, Inc.390 North Sepulveda Blvd., Suite 2070El Segundo, CA 90245-4475 USA+1 310 647-4266http://www.siderean.com
ballen at siderean dot com