en##es,’graphs,’and’crowdsourcing’ for’be7er’web’search’ ·...

37
En##es, Graphs, and Crowdsourcing for be7er Web Search Gianluca Demar#ni eXascale Infolab University of Fribourg, Switzerland gianlucademar#ni.net exascale.info

Upload: others

Post on 29-Sep-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: En##es,’Graphs,’and’Crowdsourcing’ for’be7er’Web’Search’ · Conference(ISWC’2013).’Sydney,’Australia,’October’2013.’’ Type ranking Type ranking Type

En##es,  Graphs,  and  Crowdsourcing  for  be7er  Web  Search  

Gianluca  Demar#ni  eXascale  Infolab  

University  of  Fribourg,  Switzerland    

gianlucademar#ni.net  exascale.info  

Page 2: En##es,’Graphs,’and’Crowdsourcing’ for’be7er’Web’Search’ · Conference(ISWC’2013).’Sydney,’Australia,’October’2013.’’ Type ranking Type ranking Type

Gianluca  Demar#ni  

•  M.Sc.  at  University  of  Udine,  Italy  •  Ph.D.  at  University  of  Hannover,  Germany  

–  En#ty  Retrieval  •  Worked  for  UC  Berkeley  (on  Crowdsourcing),  Yahoo!  Research  

(Spain),  L3S  Research  Center  (Germany)  

•  Post-­‐doc  at  the  eXascale  Infolab,  Uni  Fribourg,  Switzerland.  •  Lecturer  for  Social  Compu,ng  in  Fribourg  

•  Tutorial  on  En#ty  Search  at  ECIR  2012,  on  Crowdsourcing  at  ESWC  2013  and  ISWC  2013  

•  Research  Interests  –  Informa#on  Retrieval,  Seman#c  Web,  Crowdsourcing  

2  

[email protected]

Gianluca  Demar#ni  

Page 3: En##es,’Graphs,’and’Crowdsourcing’ for’be7er’Web’Search’ · Conference(ISWC’2013).’Sydney,’Australia,’October’2013.’’ Type ranking Type ranking Type

Gianluca  Demar#ni   3  

Page 4: En##es,’Graphs,’and’Crowdsourcing’ for’be7er’Web’Search’ · Conference(ISWC’2013).’Sydney,’Australia,’October’2013.’’ Type ranking Type ranking Type

Gianluca  Demar#ni   4  

Page 5: En##es,’Graphs,’and’Crowdsourcing’ for’be7er’Web’Search’ · Conference(ISWC’2013).’Sydney,’Australia,’October’2013.’’ Type ranking Type ranking Type

Web  of  Data  

•  Freebase  – Acquired  by  Google  in  July  2010.  –  Knowledge  Graph  launched  in  May  2012.  

•  Schema.org  – Driven  by  major  search  engine  companies  – Machine-­‐readable  annota#ons  of  Web  pages  

•  Linked  Open  Data  –  31  billion  triples,  Sept.  2011  

Gianluca  Demar#ni   5  

Page 6: En##es,’Graphs,’and’Crowdsourcing’ for’be7er’Web’Search’ · Conference(ISWC’2013).’Sydney,’Australia,’October’2013.’’ Type ranking Type ranking Type

Linked  Open  Data  

Z.  Kaoudi  and  I.  Manolescu,  ICDE  seminar  2013     6  

Page 7: En##es,’Graphs,’and’Crowdsourcing’ for’be7er’Web’Search’ · Conference(ISWC’2013).’Sydney,’Australia,’October’2013.’’ Type ranking Type ranking Type

I  will  talk  about  

•  En#ty  Linking/Disambigua#on  – On  the  Web  using  crowdsourcing  – For  Data  Integra#on  /  For  En#ty  Type  Ranking  

•  Ad-­‐hoc  Object  Retrieval  (En#ty  Ranking)  – Using  IR  and  graphs  

•  Crowdsourced  Query  Understanding  

Gianluca  Demar#ni   7  

Page 8: En##es,’Graphs,’and’Crowdsourcing’ for’be7er’Web’Search’ · Conference(ISWC’2013).’Sydney,’Australia,’October’2013.’’ Type ranking Type ranking Type

Disclaimer  

•  No  efficiency  evalua#on  – Approaches  not  distributed  – But  designed  to  scale  out  

•  No  user  studies  – Goal:  Obtain  high  quality  data  – Only  TREC-­‐like  evalua#on  on  effec#veness  

Gianluca  Demar#ni   8  

Page 9: En##es,’Graphs,’and’Crowdsourcing’ for’be7er’Web’Search’ · Conference(ISWC’2013).’Sydney,’Australia,’October’2013.’’ Type ranking Type ranking Type

En#ty  Linking/Disambigua#on  

Page 10: En##es,’Graphs,’and’Crowdsourcing’ for’be7er’Web’Search’ · Conference(ISWC’2013).’Sydney,’Australia,’October’2013.’’ Type ranking Type ranking Type

Gianluca  Demar#ni   10  

h7p://dbpedia.org/resource/Facebook  

h7p://dbpedia.org/resource/Instagram  

jase:Instagram  owl:sameAs  

Google  

Android  

<p>Facebook  is  not  wai#ng  for  its  ini#al  public  offering  to  make  its  first  big  purchase.</p><p>In  its  largest  acquisi#on  to  date,  the  social  network  has  purchased  Instagram,  the  popular  photo-­‐sharing  applica#on,  for  about  $1  billion  in  cash  and  stock,  the  company  said  Monday.</p>  

<p><span  about="h7p://dbpedia.org/resource/Facebook"><cite  property=”rdfs:label">Facebook</cite>  is  not  wai#ng  for  its  ini#al  public  offering  to  make  its  first  big  purchase.</span></p><p><span  about="h7p://dbpedia.org/resource/Instagram">In  its  largest  acquisi#on  to  date,  the  social  network  has  purchased  <cite  property=”rdfs:label">Instagram</cite>  ,  the  popular  photo-­‐sharing  applica#on,  for  about  $1  billion  in  cash  and  stock,  the  company  said  Monday.</span></p>  

RDFa  enrichment  

HTML:  

Page 11: En##es,’Graphs,’and’Crowdsourcing’ for’be7er’Web’Search’ · Conference(ISWC’2013).’Sydney,’Australia,’October’2013.’’ Type ranking Type ranking Type

Crowdsourcing  

•  Exploit  human  intelligence  to  solve  – Tasks  simple  for  humans,  complex  for  machines  – With  a  large  number  of  humans  (the  Crowd)  

– Small  problems:  micro-­‐tasks  (Amazon  MTurk)  

•  Examples  – Wikipedia,  Image  tagging  

•  Incen#ves  – Financial,  fun,  visibility  

Gianluca  Demar#ni   11  

Page 12: En##es,’Graphs,’and’Crowdsourcing’ for’be7er’Web’Search’ · Conference(ISWC’2013).’Sydney,’Australia,’October’2013.’’ Type ranking Type ranking Type

ZenCrowd  

•  Combine  both  algorithmic  and  manual  linking  •  Automate  manual  linking  via  crowdsourcing  

•  Dynamically  assess  human  workers  with  a  probabilis#c  reasoning  framework  

12  

Crowd  

Algorithms  Machines  

Gianluca  Demar#ni  

Page 13: En##es,’Graphs,’and’Crowdsourcing’ for’be7er’Web’Search’ · Conference(ISWC’2013).’Sydney,’Australia,’October’2013.’’ Type ranking Type ranking Type

ZenCrowd  Architecture  

Micro Matching

Tasks

HTMLPages

HTML+ RDFaPages

LOD Open Data Cloud

CrowdsourcingPlatform

ZenCrowdEntity

Extractors

LOD Index Get Entity

Input Output

Probabilistic Network

Decision Engine

Micr

o-Ta

sk M

anag

er

Workers Decisions

AlgorithmicMatchers

Gianluca  Demar#ni   13  

Gianluca  Demar#ni,  Djellel  Eddine  Difallah,  and  Philippe  Cudré-­‐Mauroux.  ZenCrowd:  Leveraging  Probabilis#c  Reasoning  and  Crowdsourcing  Techniques  for  Large-­‐Scale  En#ty  Linking.  In:  21st  Interna#onal  Conference  on  World  Wide  Web  (WWW  2012).  

Page 14: En##es,’Graphs,’and’Crowdsourcing’ for’be7er’Web’Search’ · Conference(ISWC’2013).’Sydney,’Australia,’October’2013.’’ Type ranking Type ranking Type

En#ty  Factor  Graphs  

•  Graph  components  – Workers,  links,  clicks  – Prior  probabili#es  – Link  Factors  – Constraints  

•  Probabilis#c  Inference  – Select  all  links  with  posterior  prob  >τ  

w1 w2

l1 l2

pw1( ) pw2( )

lf1( ) lf2( )

pl1( ) pl2( )

l3

lf3( )

pl3( )

c11 c22c12c21 c13 c23

u2-3( )sa1-2( )

2  workers,  6  clicks,  3  candidate  links  

Link  priors  

Worker  priors  

Observed  variables  

Link  factors  

SameAs  constraints  

Dataset  Unicity  constraints  

Gianluca  Demar#ni   14  

Page 15: En##es,’Graphs,’and’Crowdsourcing’ for’be7er’Web’Search’ · Conference(ISWC’2013).’Sydney,’Australia,’October’2013.’’ Type ranking Type ranking Type

Experimental  Evalua#on  

•  Datasets  –  25  news  ar#cles  from  

•  CNN.com  (Global  news)  

•  NYTimes.com  (Global  news)  

•  Washington-­‐post.com  (US  local  news)  

•  Timesofindia.india#mes.com  (India  news)  

•  Swissinfo.com  (Switzerland  local  news)  

–  40M  en##es  (Freebase,  DBPedia,  Geonames,  NYT)  

Gianluca  Demar#ni   15  

Page 16: En##es,’Graphs,’and’Crowdsourcing’ for’be7er’Web’Search’ · Conference(ISWC’2013).’Sydney,’Australia,’October’2013.’’ Type ranking Type ranking Type

Worker  Selec#on  

Gianluca  Demar#ni   16  

Top$US$Worker$

0$

0.5$

1$

0$ 250$ 500$

Worker&P

recision

&

Number&of&Tasks&

US$Workers$

IN$Workers$

0.6$0.62$0.64$0.66$0.68$0.7$0.72$0.74$0.76$0.78$0.8$

1$ 2$ 3$ 4$ 5$ 6$ 7$ 8$ 9$Precision)

Top)K)workers)

Page 17: En##es,’Graphs,’and’Crowdsourcing’ for’be7er’Web’Search’ · Conference(ISWC’2013).’Sydney,’Australia,’October’2013.’’ Type ranking Type ranking Type

Lessons  Learnt  

•  Crowdsourcing  +  Prob  reasoning  works!  •  But  

– Different  worker  communi#es  perform  differently  – Many  low  quality  workers  

– Comple#on  #me  may  vary  (based  on  reward)  

•  Need  to  find  the  right  workers  for  your  task  (see  WWW13  paper)  

Gianluca  Demar#ni   17  

Page 18: En##es,’Graphs,’and’Crowdsourcing’ for’be7er’Web’Search’ · Conference(ISWC’2013).’Sydney,’Australia,’October’2013.’’ Type ranking Type ranking Type

Pick-­‐A-­‐Crowd  

18  Djellel  Eddine  Difallah,  Gianluca  Demar#ni,  and  Philippe  Cudré-­‐Mauroux.  Pick-­‐A-­‐Crowd:  Tell  Me  What  You  Like,  and  I'll  Tell  You  What  to  Do.  In:  WWW2013  

Page 19: En##es,’Graphs,’and’Crowdsourcing’ for’be7er’Web’Search’ · Conference(ISWC’2013).’Sydney,’Australia,’October’2013.’’ Type ranking Type ranking Type

ZenCrowd  Summary  

•  ZenCrowd:  Probabilis#c  reasoning  over  automa#c  and  crowdsourcing  methods  for  en#ty  linking  

•  Standard  crowdsourcing  improves  6%  over  automa#c  •  4%  -­‐  35%  improvement  over  standard  crowdsourcing  •  14%  average  improvement  over  automa#c  approaches  

h7p://exascale.info/zencrowd/  

Gianluca  Demar#ni   19  

Page 20: En##es,’Graphs,’and’Crowdsourcing’ for’be7er’Web’Search’ · Conference(ISWC’2013).’Sydney,’Australia,’October’2013.’’ Type ranking Type ranking Type

On  top  of  En#ty  Linking  

•  Follow-­‐up  works:  – Also  used  for  instance  matching  across  datasets    – Ranking  En#ty  Types:  see  TRank  at  ISWC13,  nominated  for  best  paper  award  

20  

Alberto  Tonon,  Michele  Catasta,  Gianluca  Demar#ni,  Philippe  Cudré-­‐Mauroux,  and  Karl  Aberer.  TRank:  Ranking  En#ty  Types  Using  the  Web  of  Data.  In:  The  12th  Interna#onal  Seman#c  Web  Conference  (ISWC  2013).  Sydney,  Australia,  October  2013.    

Type rankingType ranking

Type ranking

Text extraction

(BoilerPipe)

Named Entity Recognition

(Stanford NER)

List of entity labels

Entity linking(inverted index:

DBpedia labels ⟹ resource URIs)

foreach

List of entity URIs

Type retrieval(inverted index:

resource URIs ⟹ type URIs)

List of type URIs

Type rankingRanked list of types

Page 21: En##es,’Graphs,’and’Crowdsourcing’ for’be7er’Web’Search’ · Conference(ISWC’2013).’Sydney,’Australia,’October’2013.’’ Type ranking Type ranking Type

Three-­‐stage  blocking  with  the  Crowd  for  Data  Integra#on  

•  1.  Cheap  clustering/inverted  index  selec#on  of  candidates  

•  2.  Expensive  similarity  measure  

•  3.  Crowdsource  low  confidence  matches  

Gianluca  Demar#ni   21  

Gianluca  Demar#ni,  Djellel  Eddine  Difallah,  and  Philippe  Cudré-­‐Mauroux.  Large-­‐Scale  Linked  Data  Integra#on  Using  Probabilis#c  Reasoning  and  Crowdsourcing.  In:  VLDB  Journal,  Volume  22,  Issue  5  (2013),  Page  665-­‐687,  Special  issue  on  Structured,  Social  and  Crowd-­‐sourced  Data  on  the  Web.  October  2013.  

Page 22: En##es,’Graphs,’and’Crowdsourcing’ for’be7er’Web’Search’ · Conference(ISWC’2013).’Sydney,’Australia,’October’2013.’’ Type ranking Type ranking Type

En#ty  Ranking  

Page 23: En##es,’Graphs,’and’Crowdsourcing’ for’be7er’Web’Search’ · Conference(ISWC’2013).’Sydney,’Australia,’October’2013.’’ Type ranking Type ranking Type

Ad-­‐hoc  Object  Retrieval  

•  Once  en##es  have  been  iden#fied…  •  We  want  to  rank  them  as  answer  to  a  query  

•  AOR  – Given  the  descrip#on  of  an  en#ty  – give  me  back  its  iden#fier  

–  Input:  query  q,  data  graph  G  – Output:  ranked  list  of  URIs  from  G  

Gianluca  Demar#ni   23  

Page 24: En##es,’Graphs,’and’Crowdsourcing’ for’be7er’Web’Search’ · Conference(ISWC’2013).’Sydney,’Australia,’October’2013.’’ Type ranking Type ranking Type

An  Hybrid  Approach  to  AOR  

Alberto  Tonon,  Gianluca  Demar#ni,  and  Philippe  Cudré-­‐Mauroux.  Combining  Inverted  Indices  and  Structured  Search  for  Ad-­‐hoc  Object  Retrieval.  In:  35th  Annual  ACM  SIGIR  Conference  (SIGIR  2012).  

LOD Cloud

index()User

Query Annotation and Expansion

Inverted Index

RDF Store

Ranking FunctionsRanking

FunctionsRanking Functions

query()

Entity SearchKeyword Query

intermediatetop-k resultsGraph-Enriched

Results

Graph Traversals(queries on object

properties)

Neighborhoods(queries on datatype

properties)

Structured Inverted Index

WordNet

3rd partysearch engines

Final Ranking Function

Pseudo-Relevance Feedback

Gianluca  Demar#ni   24  

Page 25: En##es,’Graphs,’and’Crowdsourcing’ for’be7er’Web’Search’ · Conference(ISWC’2013).’Sydney,’Australia,’October’2013.’’ Type ranking Type ranking Type

AOR  Evalua#on  

•  1.3  billions  RDF  triples  from  LOD  cloud  •  92  and  50  queries  •  Crowdsourced  relevance  judgments  

•  semsearch.yahoo.com  

Gianluca  Demar#ni   25  

Page 26: En##es,’Graphs,’and’Crowdsourcing’ for’be7er’Web’Search’ · Conference(ISWC’2013).’Sydney,’Australia,’October’2013.’’ Type ranking Type ranking Type

Evalua#on  Results  

Gianluca  Demar#ni   26  

Page 27: En##es,’Graphs,’and’Crowdsourcing’ for’be7er’Web’Search’ · Conference(ISWC’2013).’Sydney,’Australia,’October’2013.’’ Type ranking Type ranking Type

Summary  

•  AOR  =  “Given  the  descrip,on  of  an  en,ty,  give  me  back  its  iden,fier”    

•  Combining  classic  IR  techniques  +  structured  database  storing  graph  data    

•  Significantly  be7er  results  (up  to  +25%  MAP  over  BM25  baseline).    

•  Overhead  caused  from  the  graph  traversal  part  is  limited    

Gianluca  Demar#ni   27  

h7p://exascale.info/AOR/  

Page 28: En##es,’Graphs,’and’Crowdsourcing’ for’be7er’Web’Search’ · Conference(ISWC’2013).’Sydney,’Australia,’October’2013.’’ Type ranking Type ranking Type

CrowdQ:  Crowdsourced  Query  Understanding  

Page 29: En##es,’Graphs,’and’Crowdsourcing’ for’be7er’Web’Search’ · Conference(ISWC’2013).’Sydney,’Australia,’October’2013.’’ Type ranking Type ranking Type

birthdate  of  mayor  of  capital  city  of  france  

Gianluca  Demar#ni   29  

Page 30: En##es,’Graphs,’and’Crowdsourcing’ for’be7er’Web’Search’ · Conference(ISWC’2013).’Sydney,’Australia,’October’2013.’’ Type ranking Type ranking Type

capital  city  of  france  

Gianluca  Demar#ni   30  

Page 31: En##es,’Graphs,’and’Crowdsourcing’ for’be7er’Web’Search’ · Conference(ISWC’2013).’Sydney,’Australia,’October’2013.’’ Type ranking Type ranking Type

mayor  of  paris  

Gianluca  Demar#ni   31  

Page 32: En##es,’Graphs,’and’Crowdsourcing’ for’be7er’Web’Search’ · Conference(ISWC’2013).’Sydney,’Australia,’October’2013.’’ Type ranking Type ranking Type

birthdate  of  Bertrand  Delanoë  

Gianluca  Demar#ni   32  

Page 33: En##es,’Graphs,’and’Crowdsourcing’ for’be7er’Web’Search’ · Conference(ISWC’2013).’Sydney,’Australia,’October’2013.’’ Type ranking Type ranking Type

Mo#va#on  

•  Web  Search  Engines  can  answer  simple  factual  queries  directly  on  the  result  page  

•  Users  with  complex  informa#on  needs  are  o{en  unsa#sfied  

•  Purely  automa#c  techniques  are  not  enough  

•  We  want  to  solve  it  with  Crowdsourcing!  

Gianluca  Demar#ni   33  

Page 34: En##es,’Graphs,’and’Crowdsourcing’ for’be7er’Web’Search’ · Conference(ISWC’2013).’Sydney,’Australia,’October’2013.’’ Type ranking Type ranking Type

CrowdQ  

•  CrowdQ  is  the  first  system  that  uses  crowdsourcing  to  – Understand  the  intended  meaning  

– Build  a  structured  query  template  – Answer  the  query  over  Linked  Open  Data  

Gianluca  Demar#ni   34  

Gianluca  Demar#ni,  Beth  Trushkowsky,  Tim  Kraska,  and  Michael  Franklin.  CrowdQ:  Crowdsourced  Query  Understanding.  In:  6th  Biennial  Conference  on  Innova#ve  Data  Systems  Research  (CIDR  2013).  

Page 35: En##es,’Graphs,’and’Crowdsourcing’ for’be7er’Web’Search’ · Conference(ISWC’2013).’Sydney,’Australia,’October’2013.’’ Type ranking Type ranking Type

Hybrid  Human-­‐Machine  Pipeline  

Gianluca  Demar#ni   35  

Q=  birthdate  of  actors  of  forrest  gump  

Query  annota#on   Noun   Noun   Named  en#ty  

Verifica#on  

En#ty  Rela#ons  

Is  forrest  gump  this  en#ty  in  the  query?  

Which  is  the  rela#on  between:  actors  and  forrest  gump   starring  

Schema  element   Starring                          <dbpedia-­‐owl:starring>    

Verifica#on   Is  the  rela#on  between:  Indiana  Jones  –  Harrison  Ford  Back  to  the  Future  –  Michael  J.  Fox  of  the  same  type  as  Forrest  Gump  -­‐  actors        

Page 36: En##es,’Graphs,’and’Crowdsourcing’ for’be7er’Web’Search’ · Conference(ISWC’2013).’Sydney,’Australia,’October’2013.’’ Type ranking Type ranking Type

Structured  query  genera#on  

SELECT  ?y  ?x  WHERE  {  ?y  <dbpedia-­‐owl:birthdate>  ?x  .  

     ?z  <dbpedia-­‐owl:starring>  ?y  .  

     ?z  <rdfs:label>  ‘Forrest  Gump’  }  

Gianluca  Demar#ni   36  

Results  from  BTC09:  

Q=  birthdate  of  actors  of  forrest  gump  MOVIE  

MOVIE  

Page 37: En##es,’Graphs,’and’Crowdsourcing’ for’be7er’Web’Search’ · Conference(ISWC’2013).’Sydney,’Australia,’October’2013.’’ Type ranking Type ranking Type

Conclusions  

•  Structured  Data  make  Web  Search  be7er  •  Exploit  the  best  out  of  structured  and  unstructured  data  (Hybrid  AOR)  

•  Crowd  can  help  in  understanding  seman#cs  

•  Hybrid  human-­‐machine  systems  (ZenCrowd)  

•  Exploit  Human  Intelligence  at  Scale  (CrowdQ)  

gianlucademartini.net [email protected] Gianluca  Demar#ni   37