natural language queries over heterogeneous linked data graphs: a distributional-compositional...

35
Natural Language Queries over Heterogeneous Linked Data Graphs: A Distributional-Compositional Semantics Approach André Freitas and Edward Curry Insight Centre for Data Analytics International Conference on Intelligent User Interfaces Haifa, 2014

Upload: andre-freitas

Post on 10-May-2015

465 views

Category:

Technology


0 download

DESCRIPTION

The demand to access large amounts of heterogeneous structured data is emerging as a trend for many users and applications. However, the effort involved in querying heterogeneous and distributed third-party databases can create major barriers for data consumers. At the core of this problem is the semantic gap between the way users express their information needs and the representation of the data. This work aims to provide a natural language interface and an associated semantic index to support an increased level of vocabulary independency for queries over Linked Data/Semantic Web datasets, using a distributional-compositional semantics approach. Distributional semantics focuses on the automatic construction of a semantic model based on the statistical distribution of co-occurring words in large-scale texts. The proposed query model targets the following features: (i) a principled semantic approximation approach with low adaptation effort (independent from manually created resources such as ontologies, thesauri or dictionaries), (ii) comprehensive semantic matching supported by the inclusion of large volumes of distributional (unstructured) commonsense knowledge into the semantic approximation process and (iii) expressive natural language queries. The approach is evaluated using natural language queries on an open domain dataset and achieved avg. recall=0.81, mean avg. precision=0.62 and mean reciprocal rank=0.49.

TRANSCRIPT

Page 1: Natural Language Queries over Heterogeneous Linked Data Graphs: A Distributional-Compositional Semantics Approach

Natural Language Queries over

Heterogeneous Linked Data

Graphs:

A Distributional-Compositional Semantics

Approach

André Freitas and Edward CurryInsight Centre for Data Analytics

International Conference on Intelligent User Interfaces

Haifa, 2014

Page 2: Natural Language Queries over Heterogeneous Linked Data Graphs: A Distributional-Compositional Semantics Approach

Talking to your (Big) Data

Page 3: Natural Language Queries over Heterogeneous Linked Data Graphs: A Distributional-Compositional Semantics Approach

Motivation

Page 4: Natural Language Queries over Heterogeneous Linked Data Graphs: A Distributional-Compositional Semantics Approach

Shift in the Database Landscape

Heterogeneous, complex and large-scale databases.

Very-large and dynamic “schemas”.

10s-100s attributes1,000s-1,000,000s attributescirca 2000

circa 2014

Page 5: Natural Language Queries over Heterogeneous Linked Data Graphs: A Distributional-Compositional Semantics Approach

Databases for a Complex World

How do you query data on this scenario?

Page 6: Natural Language Queries over Heterogeneous Linked Data Graphs: A Distributional-Compositional Semantics Approach

Vocabulary Problem for DatabasesQuery: Who is the daughter of Bill Clinton married to?

Semantic approximationSemantic Gap

Possible representations = Commonsense Knowledge

Page 7: Natural Language Queries over Heterogeneous Linked Data Graphs: A Distributional-Compositional Semantics Approach

Semantics for a Complex World

Formal World Real World

Distributional Semantics

Query Approach

Page 8: Natural Language Queries over Heterogeneous Linked Data Graphs: A Distributional-Compositional Semantics Approach

Does it work?

Page 9: Natural Language Queries over Heterogeneous Linked Data Graphs: A Distributional-Compositional Semantics Approach

Addressing the Vocabulary Problem for Databases (with Distributional Semantics)

Gaelic: direction

Page 10: Natural Language Queries over Heterogeneous Linked Data Graphs: A Distributional-Compositional Semantics Approach

Solution (Video)

Page 11: Natural Language Queries over Heterogeneous Linked Data Graphs: A Distributional-Compositional Semantics Approach

More Complex Queries (Video)

Page 12: Natural Language Queries over Heterogeneous Linked Data Graphs: A Distributional-Compositional Semantics Approach

Treo Answers Jeopardy Queries (Video)

http://bit.ly/1hWcch9

Page 13: Natural Language Queries over Heterogeneous Linked Data Graphs: A Distributional-Compositional Semantics Approach

Evaluation

102 natural language queries (Test Collection: QALD 2011).

Avg. query execution time: 1.52 s (simple queries) – 8.53 s (all queries).

Dataset (DBpedia 3.7 + YAGO): 45,767 predicates, 5,556,492 classes and 9,434,677 instances

Page 14: Natural Language Queries over Heterogeneous Linked Data Graphs: A Distributional-Compositional Semantics Approach

Comparative Evaluation

Page 15: Natural Language Queries over Heterogeneous Linked Data Graphs: A Distributional-Compositional Semantics Approach

Query Approach

Page 16: Natural Language Queries over Heterogeneous Linked Data Graphs: A Distributional-Compositional Semantics Approach

Distributional Semantics

“Words occurring in similar (linguistic) contexts are semantically related.”

If we can equate meaning with context, we can simply record the contexts in which a word occurs in a collection of texts (a corpus).

This can then be used as a surrogate of its semantic representation.

Page 17: Natural Language Queries over Heterogeneous Linked Data Graphs: A Distributional-Compositional Semantics Approach

Distributional Semantic Model

c1

child

husbandspouse

cn

c2

function (number of times that the words occur in c1)

0.7

0.5

Commonsense is here

Page 18: Natural Language Queries over Heterogeneous Linked Data Graphs: A Distributional-Compositional Semantics Approach

Semantic Relatedness

θ

c1

child

husbandspouse

cn

c2

Works as a semantic ranking function

Page 19: Natural Language Queries over Heterogeneous Linked Data Graphs: A Distributional-Compositional Semantics Approach

Approach Overview

Query Planner

Ƭ-Space

Large-scale unstructured data

Commonsense knowledge

Database

Distributional semantics

Core semantic approximation &

composition operations

Query AnalysisQuery Query Features

Query Plan

Page 20: Natural Language Queries over Heterogeneous Linked Data Graphs: A Distributional-Compositional Semantics Approach

Approach Overview

Query Planner

Ƭ-Space

Wikipedia

RDF Data

Explicit Semantic Analysis (ESA)

Core semantic approximation &

composition operations

Query AnalysisQuery Query Features

Query Plan

Commonsense knowledge

Page 21: Natural Language Queries over Heterogeneous Linked Data Graphs: A Distributional-Compositional Semantics Approach

Ƭ-Space

e

p

r

Page 22: Natural Language Queries over Heterogeneous Linked Data Graphs: A Distributional-Compositional Semantics Approach

Core Operations

Query

Page 23: Natural Language Queries over Heterogeneous Linked Data Graphs: A Distributional-Compositional Semantics Approach

Core Operations

Search & Composition Operations

Query

Page 24: Natural Language Queries over Heterogeneous Linked Data Graphs: A Distributional-Compositional Semantics Approach

Search and Composition Operations Instance search

- Proper nouns- String similarity + node cardinality

Class (unary predicate) search- Nouns, adjectives and adverbs- String similarity + Distributional semantic relatedness

Property (binary predicate) search- Nouns, adjectives, verbs and adverbs- Distributional semantic relatedness

Navigation

Extensional expansion- Expands the instances associated with a class.

Operator application- Aggregations, conditionals, ordering, position

Disjunction & Conjunction Disambiguation dialog (instance, predicate)

Page 25: Natural Language Queries over Heterogeneous Linked Data Graphs: A Distributional-Compositional Semantics Approach

Core Principles

Minimize the impact of Ambiguity, Vagueness, Synonymy.

Address the simplest matchings first (heuristics).

Semantic Relatedness as a primitive operation.

Distributional semantics as commonsense knowledge.

Page 26: Natural Language Queries over Heterogeneous Linked Data Graphs: A Distributional-Compositional Semantics Approach

Question Analysis

Transform natural language queries into triple patterns

“Who is the daughter of Bill Clinton married to?”

Bill Clinton daughter married to

(INSTANCE) (PREDICATE) (PREDICATE) Query Features

PODS

Page 27: Natural Language Queries over Heterogeneous Linked Data Graphs: A Distributional-Compositional Semantics Approach

Query Plan

Map query features into a query plan.

A query plan contains a sequence of core operations.

(INSTANCE) (PREDICATE) (PREDICATE) Query Features

Query Plan

(1) INSTANCE SEARCH (Bill Clinton) (2) p1 <- SEARCH PREDICATE (Bill Clintion, daughter)

(3) e1 <- NAVIGATE (Bill Clintion, p1)

(4) p2 <- SEARCH PREDICATE (e1, married to)

(5) e2 <- NAVIGATE (e1, p2)

Page 28: Natural Language Queries over Heterogeneous Linked Data Graphs: A Distributional-Compositional Semantics Approach

Instance Search

Bill Clinton daughter married to

:Bill_Clinton

Query:

Linked Data:

Instance Search

Page 29: Natural Language Queries over Heterogeneous Linked Data Graphs: A Distributional-Compositional Semantics Approach

Predicate Search

Bill Clinton daughter married to

:Bill_Clinton

Query:

Linked Data:

:Chelsea_Clinton

:child

:Baptists:religion

:Yale_Law_School

:almaMater

...(PIVOT ENTITY)

(ASSOCIATED TRIPLES)

Page 30: Natural Language Queries over Heterogeneous Linked Data Graphs: A Distributional-Compositional Semantics Approach

Predicate Search

Bill Clinton daughter married to

:Bill_Clinton

Query:

Linked Data:

:Chelsea_Clinton

:child

:Baptists:religion

:Yale_Law_School

:almaMater

...

sem_rel(daughter,child)=0.054

sem_rel(daughter,child)=0.004

sem_rel(daughter,alma mater)=0.001

Which properties are semantically related to ‘daughter’?

Page 31: Natural Language Queries over Heterogeneous Linked Data Graphs: A Distributional-Compositional Semantics Approach

Navigate

Bill Clinton daughter married to

:Bill_Clinton

Query:

Linked Data:

:Chelsea_Clinton

:child

Page 32: Natural Language Queries over Heterogeneous Linked Data Graphs: A Distributional-Compositional Semantics Approach

Navigate

Bill Clinton daughter married to

:Bill_Clinton

Query:

Linked Data:

:Chelsea_Clinton

:child

(PIVOT ENTITY)

Page 33: Natural Language Queries over Heterogeneous Linked Data Graphs: A Distributional-Compositional Semantics Approach

Predicate Search

Bill Clinton daughter married to

:Bill_Clinton

Query:

Linked Data:

:Chelsea_Clinton

:child

(PIVOT ENTITY)

:Mark_Mezvinsky:spouse

Page 34: Natural Language Queries over Heterogeneous Linked Data Graphs: A Distributional-Compositional Semantics Approach

Results

Page 35: Natural Language Queries over Heterogeneous Linked Data Graphs: A Distributional-Compositional Semantics Approach

Conclusions

The compositional-distributional model supports a schema-agnostic natural language query mechanism over a large schema (open domain) database

Comprehensive and accurate semantic matching - Avg. recall=0.81, map=0.62, mrr=0.49 Medium-high expressivity

- 80% of queries answered Interactive query execution time

- Avg. 1.52 s (simple queries) – 8.53 s (all queries) / query Better recall and query coverage compared to

baselines with equivalent precision

Low adaptation effort for new datasets