unsaid: uncertainty and structure in the access to ...2014/04/02 · unsaid: uncertainty and...
TRANSCRIPT
2 April 2014, IC2 Lunch Seminar
UnSAID:Uncertainty and Structurein the Accessto Intensional Data
Pierre Senellart
2 / 24 IC2 Pierre Senellart
Uncertain data is everywhere
Numerous sources of uncertain data:
Measurement errors
Data integration from contradicting sources
Imprecise mappings between heterogeneous schemas
Imprecise automatic processes (information extraction, naturallanguage processing, etc.)
Imperfect human judgment
Lies, opinions, rumors
3 / 24 IC2 Pierre Senellart
Structured data is everywhere
Data is structured, not flat:Variety of representation formats of data in the wild:
relational tablestrees, semi-structured documentsgraphs, e.g., social networks or semantic graphsdata streamscomplex views aggregating individual information
Heterogeneous schemas
Additional structural constraints: keys, inclusion dependencies
4 / 24 IC2 Pierre Senellart
Intensional data is everywhereLots of data sources can be seen as intensional: accessing all the datain the source (in extension) is impossible or very costly, but it ispossible to access the data through views, with some accessconstraints, associated with some access cost.
Indexes over regular data sources
Deep Web sources: Web forms, Web services
The Web or social networks as partial graphs that can beexpanded by crawling
Outcome of complex automated processes: information extraction,natural language analysis, machine learning, ontology matching
Crowd data: (very) partial views of the world
Logical consequences of facts, costly to compute
5 / 24 IC2 Pierre Senellart
Introducing UnSAID
Uncertainty and Structure in the Access to Intensional Data
Jointly deal with Uncertainty, Structure, and the fact that accessto data is limited and has a cost, to solve a user’s knowledge need
Lazy evaluation whenever possible
Evolving probabilistic, structured view of the current knowledgeof the world
Solve at each step the problem: What is the next best access to dogiven my current knowledge of the world and the knowledge need
Knowledge acquisition plan (recursive, dynamic, adaptive) thatminimizes access cost, and provides probabilistic guarantees
formulation
answer & explanation
optimization
modeling
priors
intensional access
knowledge update
Knowledgeneed
Currentknowledgeof the world
Knowledgeacquisition plan
Uncertainaccess result
Structuredsource profiles
formulation
answer & explanation
optimization
modeling
priors
intensional access
knowledge update
Knowledgeneed
Currentknowledgeof the world
Knowledgeacquisition plan
Uncertainaccess result
Structuredsource profiles
formulation
answer & explanation
optimization
modeling
priors
intensional access
knowledge update
Knowledgeneed
Currentknowledgeof the world
Knowledgeacquisition plan
Uncertainaccess result
Structuredsource profiles
formulation
answer & explanation
optimization
modeling
priors
intensional access
knowledge update
Knowledgeneed
Currentknowledgeof the world
Knowledgeacquisition plan
Uncertainaccess result
Structuredsource profiles
formulation
answer & explanation
optimization
modeling
priors
intensional access
knowledge update
Knowledgeneed
Currentknowledgeof the world
Knowledgeacquisition plan
Uncertainaccess result
Structuredsource profiles
formulation
answer & explanation
optimization
modeling
priors
intensional access
knowledge update
Knowledgeneed
Currentknowledgeof the world
Knowledgeacquisition plan
Uncertainaccess result
Structuredsource profiles
formulation
answer & explanation
optimization
modeling
priors
intensional access
knowledge update
Knowledgeneed
Currentknowledgeof the world
Knowledgeacquisition plan
Uncertainaccess result
Structuredsource profiles
formulation
answer & explanation
optimization
modeling
priors
intensional access
knowledge update
Knowledgeneed
Currentknowledgeof the world
Knowledgeacquisition plan
Uncertainaccess result
Structuredsource profiles
7 / 24 IC2 Pierre Senellart
What this talk is about
General overview of my current (and recent) research, throughone-slide presentation of individual works
Hopefully, emerging consistent themes
Connections with the UnSAID problem
8 / 24 IC2 Pierre Senellart
Plan
Introduction
Instances of UnSAID
Uncertainty and Structure
UnSAID Applications
Conclusion
9 / 24 IC2 Pierre Senellart
Adaptive focused crawling(Gouriten, Maniu, and Senellart 2014)
with
3
5 0
4
0
0
3
5
3
4
2
2
3
3
5 0
4
0
0
3
5
3
4
2
2
3
2 3
0
00
10
0
1
0
1
3
00
0
0
Problem: Efficiently crawl nodes in agraph such that total score is high
Challenge: The score of a node isunknown till it is crawled
Methodology: Use various predictorsof node scores, and adaptively selectthe best one so far with multi-armedbandits
9 / 24 IC2 Pierre Senellart
Adaptive focused crawling(Gouriten, Maniu, and Senellart 2014)
with
3
5 0
4
0
0
3
5
3
4
2
2
3
3
5 0
4
0
0
3
5
3
4
2
2
3
2 3
0
00
10
0
1
0
1
3
00
0
0
Problem: Efficiently crawl nodes in agraph such that total score is high
Challenge: The score of a node isunknown till it is crawled
Methodology: Use various predictorsof node scores, and adaptively selectthe best one so far with multi-armedbandits
9 / 24 IC2 Pierre Senellart
Adaptive focused crawling(Gouriten, Maniu, and Senellart 2014)
with
3
5 0
4
0
0
3
5
3
4
2
2
3
3
5 0
4
0
0
3
5
3
4
2
2
3
2 3
0
00
10
0
1
0
1
3
00
0
0
Problem: Efficiently crawl nodes in agraph such that total score is high
Challenge: The score of a node isunknown till it is crawled
Methodology: Use various predictorsof node scores, and adaptively selectthe best one so far with multi-armedbandits
10 / 24 IC2 Pierre Senellart
Adaptive Web application crawling
with
entrypoint p4
2039
p1
239
p2
754
p3
3227
p5
2600
l 2
l3
l 1
l1 l4
Problem: Optimize the amount ofdistinct content retrieved from a Website w.r.t. the number of HTTPrequests
Challenge: No way to know a prioriwhere the content lies on the Web site
Methodology: Sample a small part ofthe Web site and discover optimalcrawling patterns from it
11 / 24 IC2 Pierre Senellart
Optimizing crowd queries under order
with
Problem: Given a query, what is thenext best question to ask the crowdwhen crowd answers are constrainedby a partial order
Challenge: Order constraints makequestions not independent of eachother
Methodology: Construct a polytopeof admissible regions and uniformlysample from it to determine theimpact of a data item
12 / 24 IC2 Pierre Senellart
Online influence maximization
with
Action Log
User | Action | Date
1 1 2007-10-102 1 2007-10-123 1 2007-10-142 2 2007-11-104 3 2007-11-12
. . .
1
42
3
0.5
0.1 0.9
0.50.2
. . .
Real World Simulator
1
42
3
. . .
Social Graph
1
42
3
. . .
Weighted Uncertain Graph
One Trial
Sampling
1
2
Activation SequencesUpdate
3
Solution Framework Problem: Run influence campaigns insocial networks, optimizing theamount of influenced nodes
Challenge: Influence probabilities areunknown
Methodology: Build a model ofinfluence probabilities and focus oninfluent nodes, with anexploration/exploitation trade-off
13 / 24 IC2 Pierre Senellart
Query answering under uncertain rules
with
Pope(X )) BuriedIn(X ; Rome) (98%)
LocatedIn(X ; Lombardy))
BelongsTo(X ; AustrianEmpire) (45%)
Pope(PiusXI)
BornIn(PiusXI; Desio)
LocatedIn(Desio; Lombardy)
9X ; BornIn(X ;Y ) ^
BelongsTo(Y ; AustrianEmpire) ^
BuriedIn(X ; Rome)?
Problem: Determine efficiently theprobability of a query being true,given some data and uncertain rulesover this data
Challenge: Produced facts may becorrelated, the same facts can begenerated in different ways,probability computation is hard ingeneral. . .
Methodology: Find restrictions on therules (guarded?) and the data(bounded tree-width?) that make theproblem tractable
14 / 24 IC2 Pierre Senellart
Plan
Introduction
Instances of UnSAID
Uncertainty and Structure
UnSAID Applications
Conclusion
15 / 24 IC2 Pierre Senellart
Efficient querying of uncertain graphs(Maniu, Cheng, and Senellart 2014)
with
06
5
60
2
06
4
3
4
26
1
61
3: 0.144: 0.01
2: 0.183: 0.01
1: 0.75
1: 0.752: 0.06
1: 0.75
1: 0.75
1: 0.5
1: 0.75
1: 0.25
1: 0.75
1: 0.5 1: 0.5
1: 1
(α)
(β)
(γ)
(ε)
(δ)
(ζ)
Problem: Optimize query evaluationon probabilistic graphs
Challenge: Probabilistic queryevaluation is hard, and standardindexing techniques for large graphsdo not work
Methodology: Build a treedecomposition that preservesprobabilities and run the query onthis tree decomposition
16 / 24 IC2 Pierre Senellart
Uniform sampling of XML documents(Tang, Amarilli, Senellart, and Bressan 2014)
with
0n
1n2n 3n
5n4n
0n
1n
2n 3n
5n4n
6n
0n
1n
3n 4n
6n5n
2n
Problem: Sample a subtree of fixedsize/characteristics uniformly atrandom from a tree, e.g., for datapricing reasons
Challenge: Naive top-down samplingdoes not work, will result in biasedsampling depending on the treestructure
Methodology: Bottom-up annotationof the tree recording distributioninformation, followed by top-downsampling
17 / 24 IC2 Pierre Senellart
Truth discovery on heterogeneous data
with
Problem: Determine true values fromintegrated semi-structured Websources
Challenge: Semi-structured data onthe Web is contradictory and copiedfrom source to source
Methodology: Estimate the truth anddetermine copy patterns betweensources, not only of base facts but ofsubtrees of the data
18 / 24 IC2 Pierre Senellart
Probabilistic XML conditioning
with
0
1
2 3 4
1e
2e 3e 4e
0
1
2 3 4
1a
2a 32 aa 32 aa
0e 0a
0
1
2
1e
2e
3e
0e
3 4
5
4e
5e
0
1
2
3 4
5
2a
4a
0a
1a
2a
0
1
2 3
1e
2e3e
0e
4
5
4e
0
1
2 3 32 aa
4
5
42 aa
0a
1a
2a
2a
5e 2a
0
1
2 3
1e
2e3e
0e
4
5
4e
0
1
2 3 ')( 332 aaa
4
5
')( 442 aaa
0a
1a
2a
5e '52 aa
0
1 3 51e3e 5e
0e
2 4 62e 4e6e
0
1 2
3 4
0e
1e 2e
3e 4e
label(0)=Rlabel(1)=Alabel(2)=Blabel(3)=Clabel(4)=D
Problem: Incorporate a logicalconstraint into a probabilistic XMLdatabase
Challenge: Constraining is not astandard probabilistic databaseoperation, NP-hard in general
Methodology: Identify tractablesubcases
19 / 24 IC2 Pierre Senellart
Provenance and Order(Amarilli, Ba, Deutch, and Senellart 2013)
with
Problem: Clean semantics fornondeterministic order-aware queries
Challenge: Existing datamanipulation languages treat order inan ad hoc manner, with nocompositional semantics
Methodology: Order-aware typesystem for strict static checking, andindividual data-level provenanceannotations for dynamic analysis ofallowed ordered operations
20 / 24 IC2 Pierre Senellart
Plan
Introduction
Instances of UnSAID
Uncertainty and Structure
UnSAID Applications
Conclusion
21 / 24 IC2 Pierre Senellart
Social Sensing of Moving Objects(Ba, Montenez, Abdessalem, and Senellart 2014)
with
Problem: Infer trajectories andmeta-information of moving objectsfrom Web and social Web data
Challenge: Uncertainty andinconsistency in extracted information
Methodology: Data cleaning byfiltering incorrect locations, and truthdiscovery to identify reliable sources
22 / 24 IC2 Pierre Senellart
Smarter urban mobility
with
Problem: Smart and adaptiverecommendations for mobility incities (transit, bike rental, car, etc.)
Challenge: Should take into accountpersonal information (calendar, etc.),past trajectories, public informationabout transit and traffic
Methodology: Map GPS tracks toroutes of public transport, learn routepatterns, infer destination while intransit and provide push suggestions
23 / 24 IC2 Pierre Senellart
Plan
Introduction
Instances of UnSAID
Uncertainty and Structure
UnSAID Applications
Conclusion
24 / 24 IC2 Pierre Senellart
What’s next?
So far, we have tackled individual aspects or specializations of theUnSAID problem
Now we need to consider the general problem, and propose generalsolutions
There is a strong potential for uncovering the unsaid informationfrom the Web
Strong connections with a number of research areas: activelearning, reinforcement learning, adaptive query evaluation, etc.Inspiration to get from these areas.
Everyone is welcome to join the effort!
Merci.
24 / 24 IC2 Pierre Senellart
What’s next?
So far, we have tackled individual aspects or specializations of theUnSAID problem
Now we need to consider the general problem, and propose generalsolutions
There is a strong potential for uncovering the unsaid informationfrom the Web
Strong connections with a number of research areas: activelearning, reinforcement learning, adaptive query evaluation, etc.Inspiration to get from these areas.
Everyone is welcome to join the effort!
Merci.
Amarilli, A., M. L. Ba, D. Deutch, and P. Senellart (Dec. 2013).Provenance for Nondeterministic Order-Aware Queries.Preprint available athttp://pierre.senellart.com/publications/amarilli2014provenance.pdf.
Ba, M. L., S. Montenez, T. Abdessalem, and P. Senellart (Apr. 2014).Extracting, Integrating, and Visualizing Uncertain WebInformation about Moving Objects. Preprint available athttp://pierre.senellart.com/publications/ba2014extracting.pdf.
Gouriten, G., S. Maniu, and P. Senellart (Mar. 2014). Scalable,Generic, and Adaptive Systems for Focused Crawling. Preprintavailable athttp://pierre.senellart.com/publications/gouriten2014scalable.pdf.
Maniu, S., R. Cheng, and P. Senellart (Mar. 2014). ProbTree: AQuery-Efficient Representation of Probabilistic Graphs. Preprintavailable athttp://pierre.senellart.com/publications/maniu2014probtree.pdf.
Tang, R., A. Amarilli, P. Senellart, and S. Bressan (Mar. 2014). Get aSample for a Discount: Sampling-Based XML Data Pricing.Preprint available athttp://pierre.senellart.com/publications/tang2014get.pdf.