summarizing semantic data

84
Summarizing Semantic Data Gong Cheng National Key Laboratory for Novel Software Technology Nanjing University, China Websoft

Upload: gong-cheng

Post on 09-Feb-2017

921 views

Category:

Science


0 download

TRANSCRIPT

Summarizing Semantic DataGong Cheng

National Key Laboratory for Novel Software TechnologyNanjing University, China

Websoft

What is semantic data?• Entity• Class• Property• Attribute• Relation

What is semantic data?• Entity• Class• Property• Attribute• Relation

Datasets

Semantic datasets on the Web

What is semantic data summarization? Why?1. Summarizing entity descriptions

(a.k.a. entity summarization)

What is semantic data summarization? Why?2. Summarizing entity associations

Alice Bob

article-A

paper-A AAAI

IJCAI

paper-B

paper-C

paper-D

inProcOf

secondAuthor reviewer

chair

firstAuthor

firstAuthor inProcOf

citessecondAuthor

cites

extends

firstAuthor

What is semantic data summarization? Why?3. Summarizing semantic datasets

Two types of summaries• Extractive methods• summary = a subset of data• summarization = ranking and selection

• Abstractive methods (a.k.a. non-extractive methods)• summary = a high-level abstraction of data• summarization = a more complex process

Outline of this talk• Summarizing entity descriptions• Summarizing entity associations• Summarizing semantic datasets

• Summarizing ontologies (if time permits)

Outline of this talk• Summarizing entity descriptions• Summarizing entity associations• Summarizing semantic datasets

Summarizing entity descriptions• Extractive methods

(summary = a subset of property-value pairs)• Metrics for ranking property-value pairs

• Intrinsic metrics• Extrinsic metrics

• Structures for combining metrics

• Abstractive methods• Not known yet

Property Value

name Leonardo da Vinci

type Person

type Artist

dateOfBirth 1452-04-15

creates Mona Lisa

creates Lady with an Ermine

knownFor Mona Lisa

influenced Richard Feynman

Intrinsic metrics1. Frequency2. Centrality3. Informativeness4. Diversity

Intrinsic metrics (1): frequency• Frequency of property• Frequency in the dataset• Frequency among entities of the same type• Frequency in this entity description• Frequency in the ontology (i.e., richness of definition)

Property Value

name Leonardo da Vinci

type Person

type Artist

dateOfBirth 1452-04-15

creates Mona Lisa

creates Lady with an Ermine

knownFor Mona Lisa

influenced Richard Feynman

Property Value

influenced …

Property Value

type Artist

creates …

Intrinsic metrics (1): frequency• Frequency of property value• Frequency in the dataset

(note: entities in text)• Frequency in this entity description

(note: indirect relations)Property Value

name Leonardo da Vinci

type Person

type Artist

dateOfBirth 1452-04-15

creates Mona Lisa

creates Lady with an Ermine

knownFor Mona Lisa

influenced Richard Feynman

Property Value

… Mona Lisa

… Lady with an Ermine

Property Value

… …Mona Lisa…

Indirect relationsmay also be counted.

Intrinsic metrics (1): frequency• Frequency of property-value pair• Frequency among similar entities• Frequency in the dataset (why not?)

Property Value

name Leonardo da Vinci

type Person

type Artist

dateOfBirth 1452-04-15

creates Mona Lisa

creates Lady with an Ermine

knownFor Mona Lisa

influenced Richard Feynman

Property Value

type Artist

influenced Richard Feynman

(a similar entity)

Intrinsic metrics (2): centrality• Centrality of property value• Within the dataset: (weighted) PageRank• On the Web: authority of datasets referencing it

Property Value

name Leonardo da Vinci

type Person

type Artist

dateOfBirth 1452-04-15

creates Mona Lisa

creates Lady with an Ermine

knownFor Mona Lisa

influenced Richard Feynman

Intrinsic metrics (2): centrality• Centrality of property-value pair

• PageRank, weighted by inverse Google distance [Cheng et al., ISWC’11]

Property Value

name Leonardo da Vinci

type Person

type Artist

dateOfBirth 1452-04-15

creates Mona Lisa

creates Lady with an Ermine

knownFor Mona Lisa

influenced Richard Feynman

name: Leonardo da Vinci

type: Person

creates: Mona Lisa

Intrinsic metrics (3): informativeness• Informativeness of property-value pair• Self-information of property-value pair [Cheng et al., ISWC’11]

• Depth of class

Property Value

name Leonardo da Vinci

type Person

type Artist

dateOfBirth 1452-04-15

creates Mona Lisa

creates Lady with an Ermine

knownFor Mona Lisa

influenced Richard Feynman

Property Value

type Person

type Scientist

Person

Artist Scientist

Intrinsic metrics (4): diversity• Diversity of properties• To avoid common properties• To avoid properties having similar values

Property Value

name Leonardo da Vinci

type Person

type Artist

dateOfBirth 1452-04-15

creates Mona Lisa

creates Lady with an Ermine

knownFor Mona Lisa

influenced Richard Feynman

Intrinsic metrics (4): diversity• Diversity of property-value pairs [Cheng et al., JoWS’15, WWW’15]

• Similarity between text: string-based, word-based• Similarity between numbers• Semantic similarity: reasoning-based

Property Value

name Leonardo da Vinci

type Person

type Artist

dateOfBirth 1452-04-15

creates Mona Lisa

creates Lady with an Ermine

knownFor Mona Lisa

influenced Richard Feynman

Person

Artist Scientist

type:Artist type:Person⇒

Extrinsic metrics1. Using external knowledge2. Context-based

Extrinsic metrics (1): using external knowledge• Using domain knowledge

• Certain properties are known to be important.

• Using indicators on the Web• Search engine hits• Bidirectional links in Wikipedia

• Using user feedback• User clicks

Property Value

name Leonardo da Vinci

type Person

type Artist

dateOfBirth 1452-04-15

creates Mona Lisa

creates Lady with an Ermine

knownFor Mona Lisa

influenced Richard Feynman

Extrinsic metrics (2): context-based• Entity search results• context = query• solution: query relevance [Cheng et al., IJSWIS’09]

Extrinsic metrics (2): context-based• Entities in a document• context = contents of the document• solution: Class Vector Model [Cheng et al., WWW’15]

Property Value

name Leonardo da Vinci

type Person

type Artist

dateOfBirth 1452-04-15

creates Mona Lisa

creates Lady with an Ermine

knownFor Mona Lisa

influenced Richard Feynman

vector = {Painting}

… The Starry Night, from MoMA’s collection, reminds us of some work painted by Leonardo da Vinci. ...

Property Value

type Painting

vector(context) = {Painting}

vector = {Artist}

Extrinsic metrics (2): context-based• Co-summarization• context = other entities• solution:

• difference from other entities [Cheng et al., WWW’15]

(for entity linking)• similarity with other entities [Cheng et al., JoWS’15]

(for entity coreference resolution)

Structures for combining metrics1. Result combination

51324

52413

51243

Ranked by Metric A

Ranked by Metric B

Ranked by Metric C

Summary

Structures for combining metrics1. Result combination (cont.)

Ranked by Metric A

Ties broken by Metric B

Structures for combining metrics2. Arithmetic combination

ɑ*MetricA + β*MetricB

Structures for combining metrics• e.g., combinatorial optimization

• Quadratic Knapsack Problem [Cheng et al., JoWS’15]

• Quadratic Multidimensional Knapsack Problem [Cheng et al., WWW’15]

Length constraint

Similarity with and difference from other entities

Inverse similarity

Diagonal: informativeness

One entity The other entityInverse similarity

Structures for combining metrics• e.g., weighted PageRank [Cheng et al., ISWC’11]

Property Value

name Leonardo da Vinci

type Person

type Artist

dateOfBirth 1452-04-15

creates Mona Lisa

creates Lady with an Ermine

knownFor Mona Lisa

influenced Richard Feynman

name: Leonardo da Vinci

type: Person

creates: Mona Lisa

Probability of jumpingProbability of following edges

Inverse Google distance Informativeness

Structures for combining metrics3. Machine Learning• Decision trees• Linear regression

Structures for combining metrics4. Complex combinations• Result combination + arithmetic combination• Machine learning + arithmetic combination

Outline of this talk• Summarizing entity descriptions• Summarizing entity associations• Summarizing semantic datasets

Summarizing entity associations• Extractive methods

• Finding and ranking associations between two entities(summary = a subset of paths)• Path finding and filtering• Intrinsic and extrinsic metrics for ranking paths• Structures for combining metrics

• Finding and ranking associations between multiple entities(summary = a subset of subgraphs)

• Abstractive methods• Ranking association patterns• Hierarchically organizing association patterns

Alice Bob

article-A

paper-A AAAI

IJCAI

paper-Bpaper-C

paper-D

inProcOf

secondAuthor reviewerchair

firstAuthorfirstAuthor inProcOf

citessecondAuthorcites

extends

firstAuthor

Finding associations between two entities• Path finding• Dijkstra or A*• Bidirectional breadth-first search (bi-BFS)• Schema-based performance optimization

Alice Bob

article-A

paper-A AAAI

IJCAI

paper-B

paper-C

paper-D

inProcOf

secondAuthor reviewer

chair

firstAuthor

firstAuthor inProcOf

citessecondAuthor

cites

extends

firstAuthor

Paper

Person

Conference

reviewer, chair

inProcOf

firstAuthor,

secondAuthor

cites, extends

O(Δd) O(Δd/2)

Finding associations between two entities• Path filtering• By length• By entities, classes, relations• By keywords

Alice Bob

article-A

paper-A AAAI

IJCAI

paper-B

paper-C

paper-D

inProcOf

secondAuthor reviewer

chair

firstAuthor

firstAuthor inProcOf

citessecondAuthor

cites

extends

firstAuthor

Ranking associations between two entities• Intrinsic metrics

• Frequency• Centrality• Informativeness• Diversity• Length• Conformity

• Extrinsic metrics• Using external knowledge• Context-based

• Structures for combining metrics

Intrinsic metrics: frequency, centrality, diversity, length

• Property frequency• Degree centrality• Diverse relations• Length

Alice Bob

article-A

paper-A AAAI

IJCAI

paper-B

paper-C

paper-D

inProcOf

secondAuthor reviewer

chair

firstAuthor

firstAuthor inProcOf

citessecondAuthor

cites

extends

firstAuthor

Intrinsic metrics: informativeness• Informativeness• Data-based informativeness: inverse relation frequency• Schema-based informativeness: depth of class/relation

Alice Bob

article-A

paper-A AAAI

IJCAI

paper-B

paper-C

paper-D

inProcOf

secondAuthor reviewer

chair

firstAuthor

firstAuthor inProcOf

citessecondAuthor

cites

extends

firstAuthor

Intrinsic metrics: conformity• Conformity to schema

Alice Bob

article-A

paper-A AAAI

IJCAI

paper-B

paper-C

paper-D

inProcOf

secondAuthor reviewer

chair

firstAuthor

firstAuthor inProcOf

citessecondAuthor

cites

extends

firstAuthor

Paper

Person

Conference

reviewer, chair

inProcOf

firstAuthor,

secondAuthor

cites, extends

Extrinsic metrics• Using external knowledge• Explicit: user-defined weights• Implicit: user’s Web browsing history

• Context-based• Query relevance

Alice Bob

article-A

paper-A AAAI

IJCAI

paper-B

paper-C

paper-D

inProcOf

secondAuthor reviewer

chair

firstAuthor

firstAuthor inProcOf

citessecondAuthor

cites

extends

firstAuthor

Finding and ranking associations between multiple entities

• association = a size-constrained connected subgraph(size = number of other entities)

3 associations via 2 other entities

Finding and ranking associations between multiple entities

• association = a size-constrained connected subgraph(size = diameter) [Cheng et al., ISWC’16]

3 associations having a diameter of 3

Finding and ranking associations between multiple entities

• Subgraph finding• n-directional breadth-first search• Distance-based performance optimization [Cheng et al., ISWC’16]

Finding and ranking associations between multiple entities

• Subgraph ranking (based on entity ranking)• PageRank• Query relevance

• Number of short paths• Random walk with restart

Finding and ranking associations between multiple entities

• association = a Steiner tree(size-unconstrained, weight-minimized)

Abstractive methods• Association pattern [Cheng et al., ISWC’14]

paper-A conf-AinProcOfsecondAuthor reviewer

paper-B conf-BinProcOffirstAuthor chair

Paper ConferenceinProcOfauthor rolePatterns

Associations

Abstractive methods• Association pattern [Cheng et al., ISWC’16]

Patterns

Associations

Ranking association patterns• Metrics• Frequency• Informativeness• Diversity

• Structures for combining metrics

Paper ConferenceinProcOfauthor role

Metrics: frequency• frequency = occurrences of canonical code [Cheng et al., ISWC’16]

=isomorphic? eq

1r1C1r2C2r3eq2$r4eq

3$$$$

(when T=e)

Metrics: frequency• frequency = occurrences of canonical code [Cheng et al., ISWC’16]

?

Solution: using query entities as proxies for classes to be ordered

Hierarchically organizing association patterns• subClassOf/subPropertyOf subPatternOf [Zhang et al., JIST’13]

Paper ConferenceinProcOfauthor role

Demo ConferenceinProcOfauthor reviewer

Poster ConferenceinProcOfauthor chair

Outline of this talk• Summarizing entity descriptions• Summarizing entity associations• Summarizing semantic datasets

Summarizing semantic datasets• Extractive methods

(summary = a subset of triples)• Centrality

• Abstractive methods1. Inferred schema2. Flat partitioning3. Hierarchical grouping

Extractive methods• Triple ranking (based on entity ranking)• Centrality: degree, PageRank

Alice Bob

article-A

paper-A AAAI

IJCAI

paper-B

paper-C

paper-D

inProcOf

secondAuthor reviewer

chair

firstAuthor

firstAuthor inProcOf

citessecondAuthor

cites

extends

firstAuthor

Abstractive methods (1): inferred schema• summary = a graph-structured (sub-)schema inferred from data

(grouping entities by classes)

Alice Bob

article-A

paper-A AAAI

IJCAI

paper-B

paper-C

paper-D

inProcOf

secondAuthor reviewer

chair

firstAuthor

firstAuthor inProcOf

citessecondAuthor

cites

extends

firstAuthor

Paper

Person

Conference

reviewer, chair

inProcOf

firstAuthor,

secondAuthor

cites, extends

Abstractive methods (1): inferred schema• Metrics for ranking classes and properties• Frequency• Centrality

Alice Bob

article-A

paper-A AAAI

IJCAI

paper-B

paper-C

paper-D

inProcOf

secondAuthor reviewer

chair

firstAuthor

firstAuthor inProcOf

citessecondAuthor

cites

extends

firstAuthor

Paper

Person

Conference

reviewer, chair

inProcOf

firstAuthor,

secondAuthor

cites, extends

Abstractive methods (2): flat partitioning• summary = entity partitions connected by relations• partitioning by shared classes (= inferred schema)• partitioning by shared attributes• partitioning by shared paths (a.k.a. bisimulation)

Alice Bob

article-A

paper-A AAAI

IJCAI

paper-B

paper-C

paper-D

inProcOf

secondAuthor reviewer

chair

firstAuthor

firstAuthor inProcOf

citessecondAuthor

cites

extends

firstAuthor

Paper

Person

Conference

reviewer, chair

inProcOf

firstAuthor,

secondAuthor

cites, extends

Abstractive methods (3): hierarchical grouping [Cheng et al., IJCAI’16]

• summary = a hierarchical grouping of entities• identified by property-value pairs• connected by relations

A hierarchical grouping of entities Relations connecting sibling groups

• Metrics for choosing groups (i.e., property-value pairs)• Coverage of data large subgroups• Height of hierarchy moderate-sized subgroups• Cohesion within groups informative property-value pairs• Overlap between groups controllable overlap• Homogeneity of groups different values of the same property

A hierarchical grouping of entities Relations connecting sibling groups

Abstractive methods (3): hierarchical grouping [Cheng et al., IJCAI’16]

• Combining metrics by combinatorial optimization(formulated as a multidimensional knapsack problem)

maximizing moderateness of each subgroupmaximizing cohesion within each subgroup

disallowing large overlap between subgroups

selecting ≤k subgroups

(optionally) disallowing different properties

Abstractive methods (3): hierarchical grouping [Cheng et al., IJCAI’16]

Concluding remarks• Research

• More application scenarios are to be identified.• New applications may promote new metrics.• More benchmarks are needed for evaluation.

• Practice• Handy tools for semantic data summarization are missing.

The 2016 ENtity Summarization Evaluation Campaign (ENSEC 2016)http://km.aifb.kit.edu/ws/sumpre2016/challenge.html

Papers on summarizing entity descriptions• Gong Cheng, Danyun Xu, Yuzhong Qu.

Summarizing Entity Descriptions for Effective and Efficient Human-centered Entity Linking.(WWW'15)

• Gong Cheng, Danyun Xu, Yuzhong Qu.C3D+P: A Summarization Method for Interactive Entity Resolution.(JoWS’15)

• Gong Cheng, Thanh Tran, Yuzhong Qu.RELIN: Relatedness and Informativeness-based Centrality for Entity Summarization.(ISWC’11)

• Gong Cheng, Yuzhong Qu.Searching Linked Objects with Falcons: Approach, Implementation and Evaluation.(IJSWIS’09)

Papers on summarizing entity associations• Gong Cheng, Daxin Liu, Yuzhong Qu.

Efficient Algorithms for Association Finding and Frequent Association Pattern Mining.(ISWC'16)

• Gong Cheng, Yanan Zhang, Yuzhong Qu.Explass: Exploring Associations between Entities via Top-K Ontological Patterns and Facets.(ISWC’14)

• Yanan Zhang, Gong Cheng, Yuzhong Qu.Towards Exploratory Relationship Search: A Clustering-based Approach(JIST’13)

Papers on summarizing semantic datasets• Gong Cheng, Cheng Jin, Yuzhong Qu.

HIEDS: A Generic and Efficient Approach to Hierarchical Dataset Summarization.(IJCAI’16)

Ontology• Terms

• Publication• Paper• Conference• title• inProc

• Term descriptions• SubClassOf(Paper, Publication)• SubClassOf(Paper, DataExactCardinality(1, title))• ObjectPropertyDomain(inProc, Paper)• ObjectPropertyRange(inProc, Conference)

Summarizing ontologies: an application

Summarizing ontologies• Extractive methods

1. Ranking terms(summary = a subset of terms)

2. Ranking term descriptions(summary = a subset of term descriptions)

3. Ranking subgraphs(summary = a subgraph)

• Abstractive methods• Not known yet

Extractive methods (1): ranking terms• Intrinsic metrics

1. Frequency2. Centrality3. Diversity4. Simplicity

• Extrinsic metrics1. Using external knowledge2. Context-based

Intrinsic metrics (1): frequency• Schema-based frequency• Data-based frequency

Intrinsic metrics (2): centrality• Middleness in the hierarchy• Degree• Betweenness• PageRank

Paper

Publication

title

inProc Conference

Publication

Paper Book

Article Poster

Intrinsic metrics (3): diversity• Coverage of hierarchy

Publication

Paper Book

Article Poster

Intrinsic metrics (4): simplicity• Number of words in the name of a term

Paper vs. PaperPublishedAtCCKS2016

Extrinsic metrics• Using external knowledge• Search engine hits• Personalization (e.g., spreading activation)

• Context-based• Query relevance Paper

Publication

title

inProc Conference

Extractive methods (2): ranking term descriptions

• Graph representation of term descriptions1. Description graph2. Term-description graph

• Ranking term descriptions• Intrinsic metrics• Extrinsic metrics

Graph representation (1): description graph [Zhang et al., WWW’07]

SubClassOf(Paper, Publication)SubClassOf(Paper, DataExactCardinality(1, title))ObjectPropertyDomain(inProc, Paper)ObjectPropertyRange(inProc, Conference)

SubClassOf(Paper, Publication)

SubClassOf(Paper, DataExactCardinality(1, title))

ObjectPropertyDomain(inProc, Paper)

ObjectPropertyRange(inProc, Conference)

Graph representation (2): term-description graph [Zhang et al., JCST’09; Cheng et al., JIST’11]

SubClassOf(Paper, Publication)SubClassOf(Paper, DataExactCardinality(1, title))ObjectPropertyDomain(inProc, Paper)ObjectPropertyRange(inProc, Conference)

SubClassOf(Paper, Publication)

SubClassOf(Paper, DataExactCardinality(1, title))

ObjectPropertyDomain(inProc, Paper)

ObjectPropertyRange(inProc, Conference)

Paper

Publication

title

inProc

Conference

Ranking term descriptions• Intrinsic metrics• Frequency• Centrality• Diversity• Cohesion/coherence

• Extrinsic metrics• Query relevance

SubClassOf(Paper, Publication)

SubClassOf(Paper, DataExactCardinality(1, title))

ObjectPropertyDomain(inProc, Paper)

ObjectPropertyRange(inProc, Conference)

Papers on summarizing ontologies• Weiyi Ge, Gong Cheng, Huiying Li, Yuzhong Qu.

Incorporating Compactness to Generate Term-association View Snippets for Ontology Search.(IP&M’13)

• Gong Cheng, Feng Ji, Shengmei Luo, Weiyi Ge, Yuzhong Qu.BipRank: Ranking and Summarizing RDF Vocabulary Descriptions.(JIST’11)

• Xiang Zhang, Gong Cheng, Weiyi Ge, Yuzhong Qu.Summarizing Vocabularies in the Global Semantic Web.(JCST’09)

• Xiang Zhang, Gong Cheng, Yuzhong Qu.Ontology Summarization Based on RDF Sentence Graph.(WWW’07)