summarizing semantic data
TRANSCRIPT
Summarizing Semantic DataGong Cheng
National Key Laboratory for Novel Software TechnologyNanjing University, China
Websoft
What is semantic data summarization? Why?1. Summarizing entity descriptions
(a.k.a. entity summarization)
What is semantic data summarization? Why?2. Summarizing entity associations
Alice Bob
article-A
paper-A AAAI
IJCAI
paper-B
paper-C
paper-D
inProcOf
secondAuthor reviewer
chair
firstAuthor
firstAuthor inProcOf
citessecondAuthor
cites
extends
firstAuthor
Two types of summaries• Extractive methods• summary = a subset of data• summarization = ranking and selection
• Abstractive methods (a.k.a. non-extractive methods)• summary = a high-level abstraction of data• summarization = a more complex process
Outline of this talk• Summarizing entity descriptions• Summarizing entity associations• Summarizing semantic datasets
• Summarizing ontologies (if time permits)
Outline of this talk• Summarizing entity descriptions• Summarizing entity associations• Summarizing semantic datasets
Summarizing entity descriptions• Extractive methods
(summary = a subset of property-value pairs)• Metrics for ranking property-value pairs
• Intrinsic metrics• Extrinsic metrics
• Structures for combining metrics
• Abstractive methods• Not known yet
Property Value
name Leonardo da Vinci
type Person
type Artist
dateOfBirth 1452-04-15
creates Mona Lisa
creates Lady with an Ermine
knownFor Mona Lisa
influenced Richard Feynman
…
Intrinsic metrics (1): frequency• Frequency of property• Frequency in the dataset• Frequency among entities of the same type• Frequency in this entity description• Frequency in the ontology (i.e., richness of definition)
Property Value
name Leonardo da Vinci
type Person
type Artist
dateOfBirth 1452-04-15
creates Mona Lisa
creates Lady with an Ermine
knownFor Mona Lisa
influenced Richard Feynman
…
Property Value
…
influenced …
…
Property Value
…
type Artist
creates …
…
Intrinsic metrics (1): frequency• Frequency of property value• Frequency in the dataset
(note: entities in text)• Frequency in this entity description
(note: indirect relations)Property Value
name Leonardo da Vinci
type Person
type Artist
dateOfBirth 1452-04-15
creates Mona Lisa
creates Lady with an Ermine
knownFor Mona Lisa
influenced Richard Feynman
…
Property Value
…
… Mona Lisa
…
… Lady with an Ermine
…
Property Value
…
… …Mona Lisa…
…
Indirect relationsmay also be counted.
Intrinsic metrics (1): frequency• Frequency of property-value pair• Frequency among similar entities• Frequency in the dataset (why not?)
Property Value
name Leonardo da Vinci
type Person
type Artist
dateOfBirth 1452-04-15
creates Mona Lisa
creates Lady with an Ermine
knownFor Mona Lisa
influenced Richard Feynman
…
Property Value
…
type Artist
…
influenced Richard Feynman
…
(a similar entity)
Intrinsic metrics (2): centrality• Centrality of property value• Within the dataset: (weighted) PageRank• On the Web: authority of datasets referencing it
Property Value
name Leonardo da Vinci
type Person
type Artist
dateOfBirth 1452-04-15
creates Mona Lisa
creates Lady with an Ermine
knownFor Mona Lisa
influenced Richard Feynman
…
Intrinsic metrics (2): centrality• Centrality of property-value pair
• PageRank, weighted by inverse Google distance [Cheng et al., ISWC’11]
Property Value
name Leonardo da Vinci
type Person
type Artist
dateOfBirth 1452-04-15
creates Mona Lisa
creates Lady with an Ermine
knownFor Mona Lisa
influenced Richard Feynman
…
name: Leonardo da Vinci
type: Person
creates: Mona Lisa
…
Intrinsic metrics (3): informativeness• Informativeness of property-value pair• Self-information of property-value pair [Cheng et al., ISWC’11]
• Depth of class
Property Value
name Leonardo da Vinci
type Person
type Artist
dateOfBirth 1452-04-15
creates Mona Lisa
creates Lady with an Ermine
knownFor Mona Lisa
influenced Richard Feynman
…
Property Value
…
type Person
type Scientist
…
Person
Artist Scientist
Intrinsic metrics (4): diversity• Diversity of properties• To avoid common properties• To avoid properties having similar values
Property Value
name Leonardo da Vinci
type Person
type Artist
dateOfBirth 1452-04-15
creates Mona Lisa
creates Lady with an Ermine
knownFor Mona Lisa
influenced Richard Feynman
…
Intrinsic metrics (4): diversity• Diversity of property-value pairs [Cheng et al., JoWS’15, WWW’15]
• Similarity between text: string-based, word-based• Similarity between numbers• Semantic similarity: reasoning-based
Property Value
name Leonardo da Vinci
type Person
type Artist
dateOfBirth 1452-04-15
creates Mona Lisa
creates Lady with an Ermine
knownFor Mona Lisa
influenced Richard Feynman
…
Person
Artist Scientist
type:Artist type:Person⇒
Extrinsic metrics (1): using external knowledge• Using domain knowledge
• Certain properties are known to be important.
• Using indicators on the Web• Search engine hits• Bidirectional links in Wikipedia
• Using user feedback• User clicks
Property Value
name Leonardo da Vinci
type Person
type Artist
dateOfBirth 1452-04-15
creates Mona Lisa
creates Lady with an Ermine
knownFor Mona Lisa
influenced Richard Feynman
…
Extrinsic metrics (2): context-based• Entity search results• context = query• solution: query relevance [Cheng et al., IJSWIS’09]
Extrinsic metrics (2): context-based• Entities in a document• context = contents of the document• solution: Class Vector Model [Cheng et al., WWW’15]
Property Value
name Leonardo da Vinci
type Person
type Artist
dateOfBirth 1452-04-15
creates Mona Lisa
creates Lady with an Ermine
knownFor Mona Lisa
influenced Richard Feynman
…
vector = {Painting}
… The Starry Night, from MoMA’s collection, reminds us of some work painted by Leonardo da Vinci. ...
Property Value
…
type Painting
…
vector(context) = {Painting}
vector = {Artist}
Extrinsic metrics (2): context-based• Co-summarization• context = other entities• solution:
• difference from other entities [Cheng et al., WWW’15]
(for entity linking)• similarity with other entities [Cheng et al., JoWS’15]
(for entity coreference resolution)
Structures for combining metrics1. Result combination
51324
52413
51243
Ranked by Metric A
Ranked by Metric B
Ranked by Metric C
Summary
Structures for combining metrics1. Result combination (cont.)
Ranked by Metric A
Ties broken by Metric B
Structures for combining metrics• e.g., combinatorial optimization
• Quadratic Knapsack Problem [Cheng et al., JoWS’15]
• Quadratic Multidimensional Knapsack Problem [Cheng et al., WWW’15]
Length constraint
Similarity with and difference from other entities
Inverse similarity
Diagonal: informativeness
One entity The other entityInverse similarity
Structures for combining metrics• e.g., weighted PageRank [Cheng et al., ISWC’11]
Property Value
name Leonardo da Vinci
type Person
type Artist
dateOfBirth 1452-04-15
creates Mona Lisa
creates Lady with an Ermine
knownFor Mona Lisa
influenced Richard Feynman
…
name: Leonardo da Vinci
type: Person
creates: Mona Lisa
…
Probability of jumpingProbability of following edges
Inverse Google distance Informativeness
Structures for combining metrics4. Complex combinations• Result combination + arithmetic combination• Machine learning + arithmetic combination
Outline of this talk• Summarizing entity descriptions• Summarizing entity associations• Summarizing semantic datasets
Summarizing entity associations• Extractive methods
• Finding and ranking associations between two entities(summary = a subset of paths)• Path finding and filtering• Intrinsic and extrinsic metrics for ranking paths• Structures for combining metrics
• Finding and ranking associations between multiple entities(summary = a subset of subgraphs)
• Abstractive methods• Ranking association patterns• Hierarchically organizing association patterns
Alice Bob
article-A
paper-A AAAI
IJCAI
paper-Bpaper-C
paper-D
inProcOf
secondAuthor reviewerchair
firstAuthorfirstAuthor inProcOf
citessecondAuthorcites
extends
firstAuthor
Finding associations between two entities• Path finding• Dijkstra or A*• Bidirectional breadth-first search (bi-BFS)• Schema-based performance optimization
Alice Bob
article-A
paper-A AAAI
IJCAI
paper-B
paper-C
paper-D
inProcOf
secondAuthor reviewer
chair
firstAuthor
firstAuthor inProcOf
citessecondAuthor
cites
extends
firstAuthor
Paper
Person
Conference
reviewer, chair
inProcOf
firstAuthor,
secondAuthor
cites, extends
O(Δd) O(Δd/2)
Finding associations between two entities• Path filtering• By length• By entities, classes, relations• By keywords
Alice Bob
article-A
paper-A AAAI
IJCAI
paper-B
paper-C
paper-D
inProcOf
secondAuthor reviewer
chair
firstAuthor
firstAuthor inProcOf
citessecondAuthor
cites
extends
firstAuthor
Ranking associations between two entities• Intrinsic metrics
• Frequency• Centrality• Informativeness• Diversity• Length• Conformity
• Extrinsic metrics• Using external knowledge• Context-based
• Structures for combining metrics
Intrinsic metrics: frequency, centrality, diversity, length
• Property frequency• Degree centrality• Diverse relations• Length
Alice Bob
article-A
paper-A AAAI
IJCAI
paper-B
paper-C
paper-D
inProcOf
secondAuthor reviewer
chair
firstAuthor
firstAuthor inProcOf
citessecondAuthor
cites
extends
firstAuthor
Intrinsic metrics: informativeness• Informativeness• Data-based informativeness: inverse relation frequency• Schema-based informativeness: depth of class/relation
Alice Bob
article-A
paper-A AAAI
IJCAI
paper-B
paper-C
paper-D
inProcOf
secondAuthor reviewer
chair
firstAuthor
firstAuthor inProcOf
citessecondAuthor
cites
extends
firstAuthor
Intrinsic metrics: conformity• Conformity to schema
Alice Bob
article-A
paper-A AAAI
IJCAI
paper-B
paper-C
paper-D
inProcOf
secondAuthor reviewer
chair
firstAuthor
firstAuthor inProcOf
citessecondAuthor
cites
extends
firstAuthor
Paper
Person
Conference
reviewer, chair
inProcOf
firstAuthor,
secondAuthor
cites, extends
Extrinsic metrics• Using external knowledge• Explicit: user-defined weights• Implicit: user’s Web browsing history
• Context-based• Query relevance
Alice Bob
article-A
paper-A AAAI
IJCAI
paper-B
paper-C
paper-D
inProcOf
secondAuthor reviewer
chair
firstAuthor
firstAuthor inProcOf
citessecondAuthor
cites
extends
firstAuthor
Finding and ranking associations between multiple entities
• association = a size-constrained connected subgraph(size = number of other entities)
3 associations via 2 other entities
Finding and ranking associations between multiple entities
• association = a size-constrained connected subgraph(size = diameter) [Cheng et al., ISWC’16]
3 associations having a diameter of 3
Finding and ranking associations between multiple entities
• Subgraph finding• n-directional breadth-first search• Distance-based performance optimization [Cheng et al., ISWC’16]
Finding and ranking associations between multiple entities
• Subgraph ranking (based on entity ranking)• PageRank• Query relevance
• Number of short paths• Random walk with restart
Finding and ranking associations between multiple entities
• association = a Steiner tree(size-unconstrained, weight-minimized)
Abstractive methods• Association pattern [Cheng et al., ISWC’14]
paper-A conf-AinProcOfsecondAuthor reviewer
paper-B conf-BinProcOffirstAuthor chair
Paper ConferenceinProcOfauthor rolePatterns
Associations
Ranking association patterns• Metrics• Frequency• Informativeness• Diversity
• Structures for combining metrics
Paper ConferenceinProcOfauthor role
Metrics: frequency• frequency = occurrences of canonical code [Cheng et al., ISWC’16]
=isomorphic? eq
1r1C1r2C2r3eq2$r4eq
3$$$$
(when T=e)
Metrics: frequency• frequency = occurrences of canonical code [Cheng et al., ISWC’16]
?
Solution: using query entities as proxies for classes to be ordered
Hierarchically organizing association patterns• subClassOf/subPropertyOf subPatternOf [Zhang et al., JIST’13]
Paper ConferenceinProcOfauthor role
Demo ConferenceinProcOfauthor reviewer
Poster ConferenceinProcOfauthor chair
Outline of this talk• Summarizing entity descriptions• Summarizing entity associations• Summarizing semantic datasets
Summarizing semantic datasets• Extractive methods
(summary = a subset of triples)• Centrality
• Abstractive methods1. Inferred schema2. Flat partitioning3. Hierarchical grouping
Extractive methods• Triple ranking (based on entity ranking)• Centrality: degree, PageRank
Alice Bob
article-A
paper-A AAAI
IJCAI
paper-B
paper-C
paper-D
inProcOf
secondAuthor reviewer
chair
firstAuthor
firstAuthor inProcOf
citessecondAuthor
cites
extends
firstAuthor
Abstractive methods (1): inferred schema• summary = a graph-structured (sub-)schema inferred from data
(grouping entities by classes)
Alice Bob
article-A
paper-A AAAI
IJCAI
paper-B
paper-C
paper-D
inProcOf
secondAuthor reviewer
chair
firstAuthor
firstAuthor inProcOf
citessecondAuthor
cites
extends
firstAuthor
Paper
Person
Conference
reviewer, chair
inProcOf
firstAuthor,
secondAuthor
cites, extends
Abstractive methods (1): inferred schema• Metrics for ranking classes and properties• Frequency• Centrality
Alice Bob
article-A
paper-A AAAI
IJCAI
paper-B
paper-C
paper-D
inProcOf
secondAuthor reviewer
chair
firstAuthor
firstAuthor inProcOf
citessecondAuthor
cites
extends
firstAuthor
Paper
Person
Conference
reviewer, chair
inProcOf
firstAuthor,
secondAuthor
cites, extends
Abstractive methods (2): flat partitioning• summary = entity partitions connected by relations• partitioning by shared classes (= inferred schema)• partitioning by shared attributes• partitioning by shared paths (a.k.a. bisimulation)
Alice Bob
article-A
paper-A AAAI
IJCAI
paper-B
paper-C
paper-D
inProcOf
secondAuthor reviewer
chair
firstAuthor
firstAuthor inProcOf
citessecondAuthor
cites
extends
firstAuthor
Paper
Person
Conference
reviewer, chair
inProcOf
firstAuthor,
secondAuthor
cites, extends
Abstractive methods (3): hierarchical grouping [Cheng et al., IJCAI’16]
• summary = a hierarchical grouping of entities• identified by property-value pairs• connected by relations
A hierarchical grouping of entities Relations connecting sibling groups
• Metrics for choosing groups (i.e., property-value pairs)• Coverage of data large subgroups• Height of hierarchy moderate-sized subgroups• Cohesion within groups informative property-value pairs• Overlap between groups controllable overlap• Homogeneity of groups different values of the same property
A hierarchical grouping of entities Relations connecting sibling groups
Abstractive methods (3): hierarchical grouping [Cheng et al., IJCAI’16]
• Combining metrics by combinatorial optimization(formulated as a multidimensional knapsack problem)
maximizing moderateness of each subgroupmaximizing cohesion within each subgroup
disallowing large overlap between subgroups
selecting ≤k subgroups
(optionally) disallowing different properties
Abstractive methods (3): hierarchical grouping [Cheng et al., IJCAI’16]
Concluding remarks• Research
• More application scenarios are to be identified.• New applications may promote new metrics.• More benchmarks are needed for evaluation.
• Practice• Handy tools for semantic data summarization are missing.
The 2016 ENtity Summarization Evaluation Campaign (ENSEC 2016)http://km.aifb.kit.edu/ws/sumpre2016/challenge.html
Papers on summarizing entity descriptions• Gong Cheng, Danyun Xu, Yuzhong Qu.
Summarizing Entity Descriptions for Effective and Efficient Human-centered Entity Linking.(WWW'15)
• Gong Cheng, Danyun Xu, Yuzhong Qu.C3D+P: A Summarization Method for Interactive Entity Resolution.(JoWS’15)
• Gong Cheng, Thanh Tran, Yuzhong Qu.RELIN: Relatedness and Informativeness-based Centrality for Entity Summarization.(ISWC’11)
• Gong Cheng, Yuzhong Qu.Searching Linked Objects with Falcons: Approach, Implementation and Evaluation.(IJSWIS’09)
Papers on summarizing entity associations• Gong Cheng, Daxin Liu, Yuzhong Qu.
Efficient Algorithms for Association Finding and Frequent Association Pattern Mining.(ISWC'16)
• Gong Cheng, Yanan Zhang, Yuzhong Qu.Explass: Exploring Associations between Entities via Top-K Ontological Patterns and Facets.(ISWC’14)
• Yanan Zhang, Gong Cheng, Yuzhong Qu.Towards Exploratory Relationship Search: A Clustering-based Approach(JIST’13)
Papers on summarizing semantic datasets• Gong Cheng, Cheng Jin, Yuzhong Qu.
HIEDS: A Generic and Efficient Approach to Hierarchical Dataset Summarization.(IJCAI’16)
Ontology• Terms
• Publication• Paper• Conference• title• inProc
• Term descriptions• SubClassOf(Paper, Publication)• SubClassOf(Paper, DataExactCardinality(1, title))• ObjectPropertyDomain(inProc, Paper)• ObjectPropertyRange(inProc, Conference)
Summarizing ontologies• Extractive methods
1. Ranking terms(summary = a subset of terms)
2. Ranking term descriptions(summary = a subset of term descriptions)
3. Ranking subgraphs(summary = a subgraph)
• Abstractive methods• Not known yet
Extractive methods (1): ranking terms• Intrinsic metrics
1. Frequency2. Centrality3. Diversity4. Simplicity
• Extrinsic metrics1. Using external knowledge2. Context-based
Intrinsic metrics (2): centrality• Middleness in the hierarchy• Degree• Betweenness• PageRank
Paper
Publication
title
inProc Conference
Publication
Paper Book
Article Poster
Intrinsic metrics (4): simplicity• Number of words in the name of a term
Paper vs. PaperPublishedAtCCKS2016
Extrinsic metrics• Using external knowledge• Search engine hits• Personalization (e.g., spreading activation)
• Context-based• Query relevance Paper
Publication
title
inProc Conference
Extractive methods (2): ranking term descriptions
• Graph representation of term descriptions1. Description graph2. Term-description graph
• Ranking term descriptions• Intrinsic metrics• Extrinsic metrics
Graph representation (1): description graph [Zhang et al., WWW’07]
SubClassOf(Paper, Publication)SubClassOf(Paper, DataExactCardinality(1, title))ObjectPropertyDomain(inProc, Paper)ObjectPropertyRange(inProc, Conference)
SubClassOf(Paper, Publication)
SubClassOf(Paper, DataExactCardinality(1, title))
ObjectPropertyDomain(inProc, Paper)
ObjectPropertyRange(inProc, Conference)
Graph representation (2): term-description graph [Zhang et al., JCST’09; Cheng et al., JIST’11]
SubClassOf(Paper, Publication)SubClassOf(Paper, DataExactCardinality(1, title))ObjectPropertyDomain(inProc, Paper)ObjectPropertyRange(inProc, Conference)
SubClassOf(Paper, Publication)
SubClassOf(Paper, DataExactCardinality(1, title))
ObjectPropertyDomain(inProc, Paper)
ObjectPropertyRange(inProc, Conference)
Paper
Publication
title
inProc
Conference
Ranking term descriptions• Intrinsic metrics• Frequency• Centrality• Diversity• Cohesion/coherence
• Extrinsic metrics• Query relevance
SubClassOf(Paper, Publication)
SubClassOf(Paper, DataExactCardinality(1, title))
ObjectPropertyDomain(inProc, Paper)
ObjectPropertyRange(inProc, Conference)
Papers on summarizing ontologies• Weiyi Ge, Gong Cheng, Huiying Li, Yuzhong Qu.
Incorporating Compactness to Generate Term-association View Snippets for Ontology Search.(IP&M’13)
• Gong Cheng, Feng Ji, Shengmei Luo, Weiyi Ge, Yuzhong Qu.BipRank: Ranking and Summarizing RDF Vocabulary Descriptions.(JIST’11)
• Xiang Zhang, Gong Cheng, Weiyi Ge, Yuzhong Qu.Summarizing Vocabularies in the Global Semantic Web.(JCST’09)
• Xiang Zhang, Gong Cheng, Yuzhong Qu.Ontology Summarization Based on RDF Sentence Graph.(WWW’07)