Download - Semantic Search Tutorial at SemTech 2012
![Page 1: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/1.jpg)
Peter Mika| Yahoo! Research, Spain
ThanhTran | Institute AIFB, KIT, Germany
Semantic Search Tutorial
Introduction
![Page 2: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/2.jpg)
About the speakers
� Peter Mika
� Senior Research Scientist
� Head of Semantic Search group at Yahoo! Research in Barcelona
� Semantic Search, Web Object Retrieval, Natural Language Processing
� Tran Duc Thanh
� Assistant Professor at AIFB, Karlsruhe Institute ofTechnology
� Head of Semantic Search group
� Semantic Search, Semantic / Linked Data Management
![Page 3: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/3.jpg)
Agenda
� Introduction (10 min)
� Semantic Web data (50 min)
� The RDF data model
� Publishing RDF
� Crawling and indexing RDF data
� Query processing (40 min)
� Ranking (30 min)
� Result presentation (15 min)
� Semantic Search evaluation (15 min)
� Questions (5 min)
![Page 4: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/4.jpg)
Why Semantic Search? I.
� “We are at the beginning of search.“ (Marissa Mayer)
� Solved large classes of queries, e.g. navigational
� Heavy investment in computational power
� Remaining queries are hard, not solvable by brute force, and require a
deep understanding of the world and human cognition
� Background knowledge and metadata can help to address poorly
solved queries
![Page 5: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/5.jpg)
Poorly solved information needs
� Ambiguous searches� paris hilton
� Long tail queries� george bush (and I mean the beer brewer in Arizona)
� Multimedia search� paris hilton sexy
� Imprecise or overly precise searches � jim hendler
� pictures of strong adventures people
� Precise searches for descriptions� countries in africa
� 32 year old computer scientist living in barcelona
� reliable digital camera under 300 dollars
Many of these queries would not be asked by users, who learned over time what search technology can and can not do.
![Page 6: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/6.jpg)
Example: multiple interpretations
![Page 7: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/7.jpg)
Why Semantic Search? II.
� The Semantic Web is now a reality
� Large amounts of data published in RDF
� Heterogeneous data of varying quality
� Users who are not skilled in writing complex queries (e.g. SPARQL)
and may not be experts in the domain
� Searching data instead or in addition to searching documents
� Direct answers
� Novel search tasks
![Page 8: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/8.jpg)
Information box with
content from and links
to Yahoo! Travel
Example: direct answers in search
Points of
interest in
Vienna, Austria
Since Aug, 2010,
‘regular’ search
results are
‘Powered by Bing’
Faceted search
for Shopping
results
Information
from the
Knowledge
Graph
![Page 9: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/9.jpg)
Novel search tasks
� Aggregation of search results
� e.g. price comparison across websites
� Analysis and prediction
� e.g. world temperature by 2020
� Semantic profiling
� recommendations based on particular interests
� Semantic log analysis
� understanding user behavior in terms of objects
� Support for complex tasks
� e.g. booking a vacation using a combination of services
![Page 10: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/10.jpg)
Contextual (pervasive, ambient) search
Yahoo! Connected TV:
Widget engine
embedded into the TV
Yahoo! IntoNow: recognize
audio and show related content
![Page 11: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/11.jpg)
Interactive search and task completion
![Page 12: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/12.jpg)
Document retrieval and data retrieval
� Information Retrieval (IR) support the retrieval of documents (document retrieval)� Representation based on lightweight syntax-centric models
� Work well for topical search
� Not so well for more complex information needs
� Web scale
� Database (DB) and Knowledge-based Systems (KB) deliver more precise answers (data retrieval)� More expressive models
� Allow for complex queries
� Retrieve concrete answers that precisely match queries
� Not just matching and filtering, but also joins
� Limitations in scalability
![Page 13: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/13.jpg)
Combination of document and data retrieval
� Documents with metadata
� Metadata may be embedded inside the document
� I’m looking for documents that mention countries in Africa.
� Data retrieval
� Structured data, but searchable text fields
� I’m looking for directors, who have directed movies where the synopsis
mentions dinosaurs.
![Page 14: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/14.jpg)
Semantic Search
� Target (combination of) document and data retrieval
� Semantic search is a retrieval paradigm that
� Exploits the structure/semantics of the data or explicit background
knowledge to understand user intent and the meaning of content
� Incorporates the intent of the query and the meaning of content into
the search process (semantic models)
� Wide range of semantic search systems
� Employ different semantic models, possibly at different steps of the
search process and in order to support different tasks
![Page 15: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/15.jpg)
Semantic models
� Semantics is concerned with the meaning of the resources made available for search
� Various representations of meaning
� Linguistic models: models of relationships among words
� Taxonomies, thesauri, dictionaries of entity names
� Inference along linguistic relations, e.g. broader/narrower terms
� Conceptual models: models of relationships among objects
� Ontologies capture entities in the world and their relationships
� Inference along domain-specific relations
� We will focus on conceptual models in this tutorial
� In particular, the RDF/OWL conceptual model for representing classes in a domain, and describing their instances
![Page 16: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/16.jpg)
Semantic Search – a process view
Query Construction
•Keywords
•Forms
•NL
•Formal language
Query Processing
•IR-style matching & ranking
•DB-style precise matching
•KB-style matching & inferences
ResultPresentation
•Query visualization
•Document and data presentation
•Summarization
Query Refinement
•Implicit feedback
•Explicit feedback
•Incentives
Document Representation
Knowledge Representation
Semantic ModelsResources
Documents
![Page 17: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/17.jpg)
Semantic Search systems
For data / document retrieval, semantic search systems might
combine a range of techniques, ranging from statistics-based IR
methods for ranking, database methods for efficient
indexing and query processing, up to complex reasoning
techniques for making inferences!
![Page 18: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/18.jpg)
Example: Information Workbench
� Addressing the lifecycle of interacting with the Web of Data� Integration of data sources� Content generation by the end user� Search and Exploration� Visualization� Publishing
� Integrated management of heterogeneous data sources� Structured and unstructured� Published and user-generated � Static and dynamic� Open domain
User-generated
DynamicStructured
Wikipedia
DBpedia, Yago
Earthquake (Data.gov)
![Page 19: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/19.jpg)
Data Sources in the Application
� Entire English Wikipedia
� Data from Linked Open Data
� DBpedia
� YAGO
� …
� Data from Data.gov (US Government)
� E.g. live data about earthquakes
� Many more
![Page 20: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/20.jpg)
Semantic Search
� Hybrid Search: Structured queries combined with keywords across structured and unstructured data sources
� Query interpretation: Translation of keywords into hybrid queries
� Keyword search/query interpretation combined with faceted search: iterative refinement process based on keywords and operations on facets
![Page 21: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/21.jpg)
Search, Refinement and Navigation
Vorlesung Knowledge Discovery - Institut AIFBFacets
Term
Completions
Keywords
Query
Translations
21
![Page 22: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/22.jpg)
Result Inspection, Analysis and Browsing
![Page 23: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/23.jpg)
Semantic Web data
![Page 24: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/24.jpg)
Data on the Web
� Data on the Web is not directly accessible
� Most web pages are generated from databases, but formatted for
human consumption
� APIs offer limited views over data
� Two solutions
� Extraction using Information Extraction (IE) techniques
� Out of scope for this tutorial
� Relying on publishers to expose structured data using standard
Semantic Web formats
![Page 25: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/25.jpg)
Information extraction
� Natural Language Processing� Named entity recognition and disambiguation, sentiment analysis etc.
� Extraction of information about entities� Suchanek et al. YAGO: A Core of Semantic Knowledge Unifying WordNet and Wikipedia, WWW, 2007.
� Wu and Weld. Autonomously Semantifying Wikipedia, CIKM 2007.
� Extraction from HTML tables� Cafarella et al. WebTables: Exploring the Power of Tables on the Web. VLDB 2008
� Wrapper induction� Kushmerick et al. Wrapper Induction for Information ExtractionText extraction. IJCAI 2007
� Filling web forms automatically (form-filling)� Madhavan et al. Google's Deep-Web Crawl. VLDB 2008
![Page 26: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/26.jpg)
Semantic Web
� Sharing data across the Web
� Standard data model
� RDF
� A number of syntaxes (file formats)
� RDF/XML, RDFa
� Powerful, logic-based languages for schemas
� OWL, RIF
� Query languages and protocols
� HTTP, SPARQL
![Page 27: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/27.jpg)
Resource Description Framework (RDF)
� Each resource (thing, entity) is identified by a URI� Globally unique identifiers
� URLs a subset of URIs
� Often abbreviated using namespaces� e.g. example:roi = http://example.org/roi
� RDF represents knowledge as a set of triples� Each triple is a single fact about the entity (an attribute or a relationship)
� A set of triples forms an RDF graph
example:roi
“Roi Blanco”
name
type foaf:Person
RDF document
![Page 28: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/28.jpg)
Linking resources
example:roi
“Roi Blanco”
namefoaf:Person
sameAs
#roi2worksWith
#peter
type
type
Roi’s homepage
Yahoo!’s website
Friend-of-a-Friend ontology
knows
![Page 29: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/29.jpg)
Ontologies
� Ontologies are the schemas for RDF graphs
� Define the intended meaning of certain classes and relationships in a domain� e.g. the FOAF ontology defines the foaf:Person class and relationships such as foaf:knows
� Ontologies are published in standard languages such as OWL
� Language defines the constructs for modeling and their meaning� e.g. subClassOf, sameAs
� Tools for editing ontologies such as Protégé, TopBraid Composer
� Reusing and extending well-known ontologies helps to interpret unknown data
� e.g. if X is subClassOf foaf:Person, and A is of type X, then we know that A is a person
![Page 30: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/30.jpg)
Example: schema.org
� Agreement on a shared set of schemas for common types of web
content
� Bing, Google, and Yahoo! as initial supporters
� Similar in intent to sitemaps.org (2006)
� Use a single format to communicate the same information to all three search engines
� Support for microdata
� schema.org covers areas of interest to all search engines
� Business listings (local), creative works (video), recipes, reviews
� User defined extensions
� Each search engine continues to develop its products
![Page 31: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/31.jpg)
Documentation and OWL ontology
![Page 32: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/32.jpg)
Publishing RDF
� Interlinked RDF documents (Linked Data)
� Each document describes a single resource with URIs pointing to
related resources
� Common RDF file formats are RDF/XML and Turtle
� Mostly implemented as a wrapper around a database or Web service
� Embedding RDF inside HTML
� RDFa, microdata
� SPARQL endpoints
� Triple stores are databases for managing RDF data
� SPARQL is a standard protocol and query language for accessing
triple stores using HTTP
![Page 33: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/33.jpg)
Linked Data
� Grass-roots community effort to (re)publish open datasets in RDF
� Centered around the Dbpedia project
� Directory of datasets at linkeddata.org, thedatahub.org
![Page 34: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/34.jpg)
Metadata in HTML anno 1995
<HTML><HEAD profile="http://dublincore.org/documents/dcq-
html/"><META name="DC.author " content=" Peter Mika "><LINK rel="DC.rights copyright"
href=" http://www.example.org/rights.html " /> <LINK rel="meta" type="application/rdf+xml" title="FOAF"
href= " http://www.cs.vu.nl/~pmika/foaf.rdf "> </HEAD> …</HTML>
What does this term mean?
How is this data related?
![Page 35: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/35.jpg)
Microformats (anno 2003)
� Mark-up for data in HTML pages
� Reuse existing HTML elements (class, rel)
� Microformats exist for a limited set of objects
� Persons, organizations, reviews, events etc., see microformats.org
� No formal schemas
� Limited reuse, extensibility of schemas
� Unclear which combinations are allowed
� No identifiers for entities
� No interlinking between entities
<div class="vcard"> <a class="email fn" href="mailto:[email protected]">Joe Friday</a> <div class="tel">+1-919-555-7878</div> <div class="title">Area Administrator, Assistant</div> </div>
![Page 36: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/36.jpg)
RDFa and RDFa Lite (2008-2012)
� W3C standards for embedding RDF data in HTML documents
� A set of new HTML attributes to be used in head or body
� A specification of how to extract the data from these attributes
� RDFa is just a syntax, you have to choose a vocabulary separately
� RDFa family
� RDFa 1.0
� Recommendation since October, 2008
� RDFa 1.1 is a small update on RDFa to make it easier to use
� RDFa Primer (Working Draft, May 8, 2012)
� RDFa 1.1 Lite is a subset of RDFa 1.1 with the most commonly used
features
� Proposed Recommendation (May 8, 2012)
![Page 37: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/37.jpg)
Example: Facebook’s Open Graph Protocol
� The ‘Like’ button provides publishers with a way to promote their
content on Facebook and build communities
� Shows up in profiles and news feed
� Site owners can later reach users who have liked an object
� Facebook Graph API allows 3rd party developers to access the data
� Open Graph Protocol is an RDFa-based format that allows to
describe the object that the user ‘Likes’
![Page 38: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/38.jpg)
Example: Facebook’s Open Graph Protocol
� RDF vocabulary to be used in conjunction with RDFa
� Simplify the work of developers by restricting the freedom in RDFa
� Activities, Businesses, Groups, Organizations, People, Places, Products and Entertainment
� Only HTML <head> accepted
<html xmlns:og="http://opengraphprotocol.org/schema/"> <head>
<title>The Rock (1996)</title> <meta property="og:title" content="The Rock" /> <meta property="og:type" content="movie" /> <meta property="og:url" content="http://www.imdb.com/title/tt0117500/" /> <meta property="og:image" content="http://ia.media-imdb.com/images/rock.jpg" /> …
</head> ...
![Page 39: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/39.jpg)
Microdata
� HTML5
� Currently under standardization at the W3C
� Originally part of the HTML5 spec, but now a separate document
� Comparable to RDFa 1.1 Lite
� Key extensibility features (such as multiple types) missing
� HTML5 also has a number of “semantic” elements such as <time>,
<video>, <article>…
<div itemscope itemid=“http://www.yahoo.com/resource/person”><p>My name is <span itemprop="name">Neil</span>.</p><p>My band is called <span itemprop="band">Four Parts Water</span>.
I was born on <time itemprop="birthday" datetime="2009-05-10"> May 10th 2009</time>.<img itemprop="image" src=”me.png" alt=”me”></p>
</div>
![Page 40: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/40.jpg)
Current state of metadata on the Web
� 31% of webpages, 5% of domains contain some metadata
� Analysis of the Bing Crawl (US crawl, January, 2012)
� RDFa is most common format
� By URL: 25% RDFa, 7% microdata, 9% microformat
� By eTLD (PLD): 4% RDFa, 0.3% microdata, 5.4% microformat
� Adoption is stronger among large publishers
� Especially for RDFa and microdata
� See also
� P. Mika, T. Potter. Metadata Statistics for a Large Web Corpus, LDOW
2012
� H.Mühleisen, C.Bizer.Web Data Commons - Extracting Structured
Data from Two Large Web Corpora, LDOW 2012
![Page 41: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/41.jpg)
Exponential growth in RDFa data
Percentage of URLs with embedded metadata in various formats
Five-fold increase between
March, 2009 and October, 2010
Another five-fold increase between
October 2010 and January, 2012
![Page 42: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/42.jpg)
Crawling and indexing
![Page 43: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/43.jpg)
Crawling the Semantic Web
� Linked Data
� Similar to HTML crawling, but the the crawler needs to parse
RDF/XML (and others) to extract URIs to be crawled
� Semantic Sitemap/VOID descriptions
� RDFa
� Same as HTML crawling, but data is extracted after crawling
� Mika et al. Investigating the Semantic Gap through Query Log
Analysis, ISWC 2010.
� SPARQL endpoints
� Endpoints are not linked, need to be discovered by other means
� Semantic Sitemap/VOID descriptions
![Page 44: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/44.jpg)
Data fusion
� Ontology matching� Widely studied in Semantic Web research, see e.g. list of publications at ontologymatching.org� Unfortunately, not much of it is applicable in a Web context due to the quality of
ontologies
� Entity resolution� Logic-based approaches in the Semantic Web� Studied as record linkage in the database literature
� Machine learning based approaches, focusing on attributes
� Graph-based approaches, see e.g. the work of Lisa Getoor are applicable to RDF data� Improvements over only attribute based matching
� Blending� Merging objects that represent the same real world entity and reconciling information from multiple sources
![Page 45: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/45.jpg)
Data quality assessment and curation
� Heterogeneity, quality of data is an even larger issue
� Quality ranges from well-curated data sets (e.g. Freebase) to microformats
� In the worst of cases, the data becomes a graph of words
� Short amounts of text: prone to mistakes in data entry or extraction
� Example: mistake in a phone number or state code
� Quality assessment and data curation
� Quality varies from data created by experts to user-generated content
� Automated data validation
� Against known-good data or using triangulation
� Validation against the ontology or using probabilistic models
� Data validation by trained professionals or crowdsourcing
� Sampling data for evaluation
� Curation based on user feedback
![Page 46: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/46.jpg)
Indexing
� Search requires matching and ranking
� Matching selects a subset of the elements to be scored
� The goal of indexing is to speed up matching
� Retrieval needs to be performed in milliseconds
� Without an index, retrieval would require streaming through the
collection
� The type of index depends on the query model to support
� DB-style indexing
� IR-style indexing
![Page 47: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/47.jpg)
IR-style indexing
� Index data as text
� Create virtual documents from data
� One virtual document per subgraph, resource or triple
� typically: resource
� Key differences to Text Retrieval
� RDF data is structured
� Minimally, queries on property values are required
![Page 48: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/48.jpg)
Horizontal index structure
� Two fields (indices): one for terms, one for properties
� For each term, store the property on the same position in the property
index
� Positions are required even without phrase queries
� Query engine needs to support the alignment operator
� ✓ Dictionary is number of unique terms + number of properties
� Occurrences is number of tokens * 2
![Page 49: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/49.jpg)
Vertical index structure
� One field (index) per property
� Positions are not required
� But useful for phrase queries
� Query engine needs to support fields
� Dictionary is number of unique terms
� Occurrences is number of tokens
� ✗ Number of fields is a problem for merging, query performance
![Page 50: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/50.jpg)
Distributed indexing
� MapReduce is ideal for building inverted indices
� Map creates (term, {doc1}) pairs
� Reduce collects all docs for the same term: (term, {doc1, doc2…}
� Sub-indices are merged separately
� Term-partitioned indices
� Peter Mika. Distributed Indexing for Semantic Search, SemSearch 2010.
![Page 51: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/51.jpg)
Query Processing / Matching
![Page 52: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/52.jpg)
Structure
� Taxonomy of search approaches
� Query processing / matching techniques for Semantic Search
� Types of semantic data
� Formalisms for querying semantic data
� Approaches
� General task: hybrid graph pattern matching
� Matching keyword query against text
� Matching structured query against structured data
� Matching keyword query against structured data
� Matching structured query against text (a hybrid case)
� Main tasks, challenges and opportunities
![Page 53: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/53.jpg)
Taxonomy of search approaches (1)
� The search problem
� A collection of resources, called data
� Information needs expressed as queries
� Search is the task of efficiently computing results from data that are relevant to queries
� Document data retrieval vs. structured data retrieval
� Differences in query and data representation and matching
� Efficiently retrieve structured data that exactly match formal information needs expressed as structured queries
� Effectively rank textual results that match ambiguous NL / keyword queries to a certain degree (notions of relevance)
� Semantic search: ranked retrieval of document and structured data (given ambiguous queries / data)
![Page 54: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/54.jpg)
Taxonomy of search approaches (2)
Query
Data
Matching
� Exact
� Complete
� Sound
• Approximate
• Not complete
• Not sound
• Ranked
• Best effort
• Top-k
Query processing mainly focuses on efficiency of matching whereas ranking
deals with degree of matching (relevance)!
Query engines (of databases)
Search engines (stand-alone/database extensons)
![Page 55: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/55.jpg)
Query processing for Semantic Search (1)
� Resources represented by semantic data ranging from� Structured data with well defined schemas
� Semi-structured data with incomplete or no schemas
� Data that largely comprise text
� Hybrid / embedded data
� Information needs of varying complexity, captured using different formalisms and querying paradigms � Natural language texts and keywords
� Form-based inputs
� Formal structured queries
(Search is end-user oriented paradigm, requires “natural”, intuitive querying interfaces)
� Semantic search: efficiently computing results (query processing)from data that are relevant to queries (ranking)
![Page 56: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/56.jpg)
Query processing for Semantic Search (2)
Query
DataMatching
KeywordsNL
QuestionsForm- / facet-based Inputs
Structured Queries (SPARQL)
OWL ontologies with rich, formal semantics
Structured RDF data
Semi-Structured RDF data
RDF data embedded in text
(RDFa)
Ambiquities
Ambiquities: confidence degree, truth/trust value…
![Page 57: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/57.jpg)
Query processing for Semantic Search (3)
Keyword query on textual data, e.g. Web search systems
Keyword query on structured data,
e.g. search extensions for databases
Structured query on textual data , e.g. querying extension for search systems?
Structured query on structured data
e.g. standard querying interface for databases / RDF stores
Semantic Search target different
group of users, information needs,
and types of data. Query processing
for semantic search is hybrid
combination of techniques!
Unstructured
Query
Structured
Query
Textual Data
Structured Data
![Page 58: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/58.jpg)
Types of data models (1)
� Textual
� Bag-of-words
� Represent documents, text in structured data,…, real-world objects (captured as structured data)
� Lacks “structure” � in text, e.g. linguistic structure, hyperlinks, (positional information)
� Structure in structured data representation
In combination with Cloud
Computing technologies,
promising solutions for the
management of `big data'
have emerged. Existing
industry solutions are able to
support complex queries and
analytics tasks with terabytes
of data. For example, using a
Greenplum.
combination
Cloud
Computing
Technologies
solutions
management
`big data'
industry
solutions
support
complex
……
term (statistics)
![Page 59: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/59.jpg)
Types of data models (2)
� Textual
� Structured � Resource Description Framework (RDF)
� Represent real-world objects, services, applications, …. documents
� Resource attribute values and relationships between resources
� Schema
Bob
Personcreator Picture
![Page 60: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/60.jpg)
Types of data models (3)
� Textual
� Structured
� Hybrid
� RDF data embedded in text (RDFa)
![Page 61: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/61.jpg)
Types of data models – RDFa (1)
…<div about="/alice/posts/trouble_with_bob">
<h2 property="dc:title">The trouble with Bob</h2><h3 property="dc:creator">Alice</h3>
Bob is a good friend of mine. We went to the same university, and also shared an apartment in Berlin in 2008. The trouble with Bob is that
he takes much better photos than I do:
<div about="http://example.com/bob/photos/sunset.jpg"><img src="http://example.com/bob/photos/sunset.jpg" /><span property="dc:title">Beautiful Sunset</span>by <span property="dc:creator">Bob</span>.</div>
</div>…
adopted from : http://www.w3.org/TR/xhtml-rdfa-primer/
![Page 62: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/62.jpg)
Types of semantic data – RDFa (2)
Bob is a good friend of mine. We went to the same university, and also shared an apartment in Berlin in 2008. The trouble with Bob is that he takes much better photos than I do:
content
content
adopted from : http://www.w3.org/TR/xhtml-rdfa-primer/
![Page 63: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/63.jpg)
Types of semantic data - conclusion
Semantic data in general can be conceived as a graph
with text and structured data items as nodes, and
edges represent different types of relationships
including explicit semantic relationships and
vaguely specified ones such as hyperlinks!
![Page 64: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/64.jpg)
Formalisms for querying semantic data (1)
� Unstructured queries
� Fully-structured queries
� Hybrid queries: unstructured + structured
Example information need
“Information about a friend of Alice, who shared an apartment with
her in Berlin and knows someone working at KIT.”
![Page 65: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/65.jpg)
Formalisms for querying semantic data (2)
� Unstructured
� NL
� Keywords apartment Berlin Aliceshared
Example information need
“Information about a friend of Alice, who shared an apartment
with her in Berlin and knows someone working at KIT.”
![Page 66: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/66.jpg)
Formalisms for querying semantic data (3)
� Unstructured
� Fully-structured
� SPARQL: BGP, filter, optional, union, select, construct, ask, describe
� PREFIX ns: <http://example.org/ns#>
SELECT ?x
WHERE { ?x ns:knows ? y. ?y ns:name “Alice”.
?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT” }
Example information need
“Information about a friend of Alice, who shared an apartment
with her in Berlin and knows someone working at KIT.”
![Page 67: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/67.jpg)
Formalisms for querying semantic data (4)
� Fully-structured
� Unstructured
� Hybrid: content and structure constraints
?x ns:knows ? y. ?y ns:name “Alice”.
?x ns:knows ?z. ?z ns: works ?v.
?v ns:name “KIT”
“shared apartment Berlin Alice”
![Page 68: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/68.jpg)
Formalisms for querying semantic data (5)
� Fully-structured
� Unstructured
� Hybrid: content and structure constraints
?x ns:knows ? y. ?y ns:name “Alice”.
?x ns:knows ?z. ?z ns: works ?v.
?v ns:name “KIT”
“shared apartment Berlin Alice”
![Page 69: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/69.jpg)
Formalisms for querying semantic data - conclusion
Semantic search queries can be conceived as graph
patterns with nodes referring to text and
structured data items, and edges referring to
relationships between these items!
![Page 70: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/70.jpg)
Processing hybrid graph patterns (1)
Alice
Bob is a good friend of
mine. We went to the
same university, and
also shared an apartment
in Berlin in 2008. The
trouble with Bob is that
he takes much better
photos than I do:
trouble with bob
Bob
sunset.jpg
Beautiful
Sunset
Thanh
KIT
Germany
Semantic
Search
2009
Germany
PeterFluidOps 34
?y ns:name “Alice”. ?x ns:knows ? y
apartment shared Berlin Alice ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT”
Example information need“Information about a friend of Alice, who shared an apartment with her in Berlin and
knows someone working at KIT.”
![Page 71: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/71.jpg)
Processing hybrid graph patterns (2)
� Matching hybrid graph patterns against data
![Page 72: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/72.jpg)
Matching keyword query against text
• Retrieve documents
• Inverted list (inverted index)
keyword � {<doc1, pos, score, ...>,
<doc2, pos, score, ...>, ...}
• AND-semantics: top-k join
shared
shared berlin alice= =
shared Berlin Alice shared Berlin Alice
D1 D1 D1
![Page 73: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/73.jpg)
Matching structured query against structured data
• Retrieve data for triple patterns
• Index on tables
• Multiple “redundant” indexes to cover different access patterns
• Join (conjunction of triples)
• Blocking, e.g. linear merge join (required sorted input)
• Non-blocking, e.g. symmetric hash-join
• Materialized join indexes
SP-index PO-index
==
=
?x ns:knows ?y. ?x ns:knows ?z.
?z ns: works ?v. ?v ns:name “KIT”Per1 ns:works ?v ?v ns:name “KIT”
Per1 ns:works Ins1 Ins1 ns:name KIT
Per1 ns:works Ins1 Ins1 ns:name KIT
![Page 74: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/74.jpg)
Matching keyword query against structured data
• Retrieve keyword elements
• Using inverted index
keyword � {<el1, score, ...>, <el2, score, ...>,…}
• Exploration / “Join”
• Data indexes for triple lookup
• Materialized index (paths up to graphs)
• Top-k Steiner tree search, top-k subgraph exploration
↔ ↔
==
Alice Bob KITAlice Bob KIT
Alice ns:knows Bob
Bob ns:works Inst1
Inst1 ns:name KIT
![Page 75: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/75.jpg)
Matching structured query against text
• Based on offline IE (offline see Peter’s slides)
• Based on online IE, i.e., “retrieve “ is as follows
• Derive keywords to retrieve relevant documents
• On-the-fly information extraction, i.e., phrase pattern matching “X name Y”
• Retrieve extracted data for structured part
• Retrieve documents for derived text patterns, e.g. sequence, windows, reg. exp.
?x ns:knows ?y. ?x ns:knows ?z.
?z ns: works ?v. ?v ns:name “KIT”
name
knows
KIT
![Page 76: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/76.jpg)
Matching structured query against text
• Index
• Inverted index for document retrieval and pattern matching
• Join index � inverted index for storing materialized joins between keywords
• Neighborhood indexes for phrase patterns
?x ns:knows ?y. ?x ns:knows ?z.
?z ns: works ?v. ?v ns:name “KIT”
KIT
name
knows
KIT
name
![Page 77: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/77.jpg)
Query processing – main tasks
� Retrieval
� Documents , data elements, triples, paths,
graphs
� Inverted index,…, but also other (B+ tree)
� Index documents, triples, materialized paths
� Join
� Different join implementations, efficiency
depends on availability of indexes
� Non-blocking join good for early result
reporting and for “unpredictable” Linked
Data / data streams scenario
Query
Data
Matching
![Page 78: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/78.jpg)
Query processing – more tasks
� More complex queries: disjunction, aggregation, grouping, analytics…
� Join order optimization
� Approximate � Approximate the search space
� Approximate the results (matching, join)
� Parallelization
� Top-k � Use only some entries in the input streams to produce k results
� Multiple sources� Federation, routing
� On-the-fly mapping, similarity join
� Hybrid� Join text and data
Query
Data
Matching
![Page 79: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/79.jpg)
Query processing on the Web -
research challenges and opportunities
� Large amount of semantic
data
� Data inconsistent,
redundant, and low quality
� Large amount of data
embedded in text
� Large amount of sources
� Large amount of links
between sources
• Optimization, parallelization
• Approximation
• Hybrid querying and data management
• Federation, routing
• Online schema mappings
• Similarity join
![Page 80: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/80.jpg)
Ranking
![Page 81: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/81.jpg)
Structure
� Problem definition
� Types of ambiguities
� Ranking paradigms
� Model construction
� Content-based
� Structure-based
![Page 82: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/82.jpg)
Ranking – problem definition
Query
Data
Matching
• Ambiguities arise when representation is incomplete / imprecise
• Ambiguities at the level of
• elements (content ambiguity)
• structure between elements (structure ambiguity)
Due to ambiguities in the representation of the information needs and
the underlying resources, the results cannot be guaranteed to exactly
match the query. Ranking is the problem of determining the degree
of matching using some notions of relevance.
![Page 83: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/83.jpg)
Content ambiguity
Alice
Bob is a good friend of
mine. We went to the
same university, and
also shared an apartment
in Berlin in 2008. The
trouble with Bob is that
he takes much better
photos than I do:
trouble with bob
Bob
sunset.jpg
Beautiful
Sunset
Thanh
KIT
Germany
Semantic
Search
2009
Germany
PeterFluidOps 34
?y ns:name “Alice”. ?x ns:knows ? y
apartment shared Berlin Alice ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT”
What is meant by “Berlin” in the query?
What is meant by “Berlin” in the data?
A city with the name Berlin? a person?
What is meant by “KIT” in the query?
What is meant by “KIT” in the data?
A research group? a university? a location?
![Page 84: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/84.jpg)
Structure ambiguity
Alice
Bob is a good friend of
mine. We went to the
same university, and
also shared an apartment
in Berlin in 2008. The
trouble with Bob is that
he takes much better
photos than I do:
trouble with bob
Bob
sunset.jpg
Beautiful
Sunset
Thanh
KIT
Germany
Semantic
Search
2009
Germany
PeterFluidOps 34
?y ns:name “Alice”. ?x ns:knows ? y
apartment shared Berlin Alice ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT”
What is the connection between “Berlin” and “Alice”? Friend? Co-worker?
What is meant by “works”? Works at? employed?
![Page 85: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/85.jpg)
Ambiguity
� Recall: query processing is matching at the level of syntax and
semantics
� Ambiguities arise when data or query allow for multiple
interpretations, i.e. multiple matches
� Syntactic, e.g. works vs. works at
� Semantic, e.g. works vs. employ
� “Aboutness”, i.e., contain some elements which represent the
correct interpretation
� Ambiguities arise when matching elements of different granularities
� Does i contains the interpretation for j, given some part(s) of i(syntactically/semantically) match j
� E.g. Berlin vs. “…we went to the same university, and also, we shared an apartment in Berlin in 2008…”
� Strictly speaking, ranking is performed after syntactic / semantic matching is done!
![Page 86: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/86.jpg)
Features: What to use to deal with ambiguities?
What is meant by “Berlin”? What is the connection between
“Berlin” and “Alice”?
� Content features
� Frequencies of terms: d more likely to be “about” a query term k
when d more often, mentions k (probabilistic IR)
� Co-occurrences: terms K that often co-occur form a contextual
interpretation, i.e., topics (cluster hypothesis)
� Structure features
� Consider relevance at level of fields
� Linked-based popularity
![Page 87: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/87.jpg)
Ranking paradigms
� Explicit relevance model
� Foundation: probability ranking principle
� Ranking results by the posterior probability (odds) of being
observed in the relevant class:
� P(w|R) varies in different approaches, e.g., binary independence
model, 2-poisson model, relevance model
)|()|()(),...,|( )|(1
1 mqPmwPmPqqwPRwP k
k
iMmk ∑∑
=∈
=≈
))|(1()|( )|( ∏∏∉∈
−=DwDw
NwPRwPRDPP(D|R)
P(D/N)
![Page 88: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/88.jpg)
Ranking paradigms
� No explicit notion of relevance: similarity between the query and
the document model
� Vector space model (cosine similarity)
� Language models (KL divergence)
)),...,(,),...,((),( ,,1,,1 qkqdtd wwwwCosdqSim =
)|(
)|(log()|()||(),(
d
Vtdq tP
tPtPKLdqSim
θθ
θθθ ∑∈
−=−=
![Page 89: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/89.jpg)
Model construction
� How to obtain
� Relevance models?
� Weights for query / document terms?
� Language models for document / queries?
![Page 90: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/90.jpg)
Content-based model construction
� Document statistics, e.g.
� Term frequency
� Document length
� Collection statistics, e.g.
� Inverse document frequency
� Background language models
)|()1(||
)|( CtPd
tftP d λλθ −+=
idfd
tfw dt ∗=
||,
•An object is more likely about “Berlin”?
•When it contains a
relatively high number of
mentions of the term
“Berlin”
• When the number of
mentions of this term in the
overall collection is relatively
low
![Page 91: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/91.jpg)
Structure-based model construction
� Consider structure of objects during content-based modeling,
i.e., to obtain structured content-based model
� Content-based model for structured objects, documents and for
general tuples
)|()|( fFf
fd
d
tPtP θαθ ∑∈
=
•An object is more likely about “Berlin”? •When one of its (important) fields contains a relatively high
number of mentions of the term “Berlin”
![Page 92: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/92.jpg)
Structure-based model construction
� PageRank
� Link analysis algorithm
� Measuring relative importance of nodes
� Link counts as a vote of support
� The PageRank of a node recursively depends on the number and
PageRank of all nodes that link to it (incoming links)
� ObjectRank
� Types and semantics of links vary in structured data setting
� Authority transfer schema graph specifies connection strengths
� Recursively compute authority transfer data graph
•An object about “Berlin” is more important than one another? •When a relatively large number of objects are linked to it
![Page 93: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/93.jpg)
Taxonomy of ranking approaches
� Explicitly vs. non-explicitly relevance-based
� Content-based ranking
� Structure-based ranking
� Content- and-structure-based ranking
![Page 94: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/94.jpg)
Result Presentation
![Page 95: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/95.jpg)
Search interface
� Input and output functionality� helping the user to formulate complex queries� presenting the results in an intelligent manner
� Semantic Search brings improvements in� Query formulation� Snippet generation� Suggesting related entities� Adaptive and interactive presentation
� Presentation adapts to the kind of query and results presented
� Object results can be actionable, e.g. buy this product
� Aggregated search� Grouping similar items, summarizing results in various ways
� Filtering (facets), possibly across different dimensions
� Task completion� Help the user to fulfill the task by placing the query in a task context
![Page 96: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/96.jpg)
Query formulation
� “Snap-to-grid”: suggest the most likely interpretation of the query
� Given the ontology or a summary of the data
� While the user is typing or after issuing the query
� Example: Freebase suggest, TrueKnowledge
![Page 97: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/97.jpg)
Enhanced results/Rich Snippets
� Use mark-up from the webpage to generate search snippets
� Originally invented at Yahoo! (SearchMonkey)
� Google, Yahoo!, Bing, Yandex now consume schema.org markup
� Validators available from Google and Bing
![Page 98: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/98.jpg)
Other result presentation tasks
� Select the most relevant resources within an RDF document
� Penin et al. Snippet Generation for Semantic Web Search Engines,
ASWC 2010
� For each resource, rank the properties to be displayed
� Natural Language Generation (NLG)
� Verbalize, explain results
![Page 99: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/99.jpg)
Aggregated search: facets
![Page 100: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/100.jpg)
Aggregated search: Sig.ma
![Page 101: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/101.jpg)
Related entities
Related actors and movies
![Page 102: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/102.jpg)
Adaptive presentation:
housing search
![Page 103: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/103.jpg)
Evaluation
![Page 104: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/104.jpg)
Semantic Search challenge (2010/2011)
� Two tasks
� Entity Search
� Queries where the user is looking for a single real world object
� Pound et al. Ad-hoc Object Retrieval in the Web of Data, WWW 2010.
� List search (new in 2011)
� Queries where the user is looking for a class of objects
� Billion Triples Challenge 2009 dataset
� Evaluated using Amazon’s Mechanical Turk
� Halpin et al. Evaluating Ad-Hoc Object Retrieval, IWEST 2010
� Blanco et al. Repeatable and Reliable Search System Evaluation using
Crowd-Sourcing, SIGIR2011
![Page 105: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/105.jpg)
Evaluation form
![Page 106: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/106.jpg)
Catching the bad guys
� Payment can be rejected for workers who try to game the system
� An explanation is commonly expected, though cheaters rarely complain
� We opted to mix control questions into the real results
� Gold-win cases that are known to be perfect
� Gold-loose cases that are known to be bad
� Metrics
� Avg. and std. dev on gold-win and gold-loose results
� Time to complete
Worker Known bad
Real Known Good Total N
Time to complete (sec)N Mean N Mean N Mean
badguy 20 2.556 200 2.738 20 2.684 240 29.6
goodguy 13 1 130 2.038 13 3 156 95
whoknows 1 1 21 1.571 2 3 24 83.5
![Page 107: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/107.jpg)
Other evaluations
� TREC Entity Track
� Related Entity Finding� Entities related to a given entity through a particular relationship
� Retrieval over documents (ClueWeb 09 collection)
� Example: (Homepages of) airlines that fly Boeing 747
� Entity List Completion� Given some elements of a list of entities, complete the list
� Question Answering over Linked Data
� Retrieval over specific datasets (Dbpedia and MusicBrainz)
� Full natural language questions of different forms
� Correct results defined by an equivalent SPARQL query
� Example: Give me all actors starring in Batman Begins.
![Page 108: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/108.jpg)
Resources
![Page 109: Semantic Search Tutorial at SemTech 2012](https://reader034.vdocument.in/reader034/viewer/2022042623/54c79b8d4a7959820a8b45ee/html5/thumbnails/109.jpg)
Resources
� Books
� Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern Information
Retrieval. ACM Press. 2011
� Survey papers
� ThanhTran, Peter Mika. Survey of Semantic Search Approaches.
Under submission, 2012.
� Conferences and workshops
� ISWC, ESWC, WWW, SIGIR, CIKM, SemTech
� Semantic Search workshop series
� Exploiting Semantic Annotations in Information Retrieval (ESAIR)
� Entity-oriented Search (EOS) workshop