edbt 2010, belhajjame

32
Feedback-Based Annotation, Selection and Refinement of Schema Mappings for Dataspaces Khalid Belhajjame, Norman W. Paton, Suzanne M. Embury, Alvaro A. A. Fernandes, and Cornelia Hedeler 1 EDBT/ICDT 2010

Upload: khalid-belhajjame

Post on 06-May-2015

854 views

Category:

Education


3 download

DESCRIPTION

A talk given at the EDBT/ICDT 2010 conference. For more details, visit the project website at http://img.cs.manchester.ac.uk/dataspaces/dataspaces.html

TRANSCRIPT

Page 1: Edbt 2010, Belhajjame

Feedback-Based Annotation, Selection and Refinement of Schema Mappings for Dataspaces

Khalid Belhajjame, Norman W. Paton, Suzanne M. Embury, Alvaro A. A. Fernandes, and Cornelia Hedeler

1  EDBT/ICDT  2010  

Page 2: Edbt 2010, Belhajjame

Data  Integra2on  

EDBT/ICDT  2010   2  

PedroDB   PepSeeker   Pride   GPMDB  

Scien2st  

What  are  the  available  proteins  of  the  Fruit  Fly?    

Integra2on  Schema  

Mappings  

Page 3: Edbt 2010, Belhajjame

Towards  Pay-­‐as-­‐you-­‐go  Data  Integra2on  

  Data  Integra*on  

–  SeKng  up  a  data  integra2on  system  requires  significant  upfront  effort  

–  The  specifica2on  of  schema  mappings  has  proved  to  be  2me  and  resource  consuming:  it  requires  deep  knowledge  of  the  sources  to  be  integrated  as  well  as  the  user’s  requirements.  

  Dataspaces:  a  Pay-­‐as-­‐you-­‐go  Data  Integra*on  [Franklin  et  al.  2005]  

–  Reduce  the  up-­‐front  cost  required  to  setup  a  data  integra2on  system:  Provide  some  services  immediately  

–  Gradually  improve  the  services  provided  by  the  system  through  interac2on  with  end  users  in  a  pay-­‐as-­‐you-­‐go  fashion.  

EDBT/ICDT  2010   3  

M.  J.  Franklin,  A.  Y.  Halevy,  and  D.  Maier.  From  databases  to  dataspaces:  a  new  abstrac2on  for  informa2on  management.  SIGMOD  Record,  34(4):27–33,  2005.  

Page 4: Edbt 2010, Belhajjame

Pay-­‐as-­‐you-­‐go  Data  Integra2on  

EDBT/ICDT  2010   4  

PedroDB   PepSeeker   Pride   GPMDB  

Scien2st  

Integra2on  Schema  

Mappings  

What  are  the  available  proteins  of  the  Fruit  Fly?    

Bootstrap  Dataspaces  

Objec2ve  of  the  present  work:    Inves2gate  Pay-­‐as-­‐you-­‐go  Annota2on,  Selec2on,  and  Refinement  of  Schema  Mappings  

Page 5: Edbt 2010, Belhajjame

Pay-­‐as-­‐you-­‐go  Data  Integra2on  

 We consider that integration schema and source schemas are relational, and that the schema mappings that define the extent of the relations in the integration schema, r, are global as view mappings of the form:

m = ⟨r,qs⟩ where qs is a relational query over the source schemas.

  A relation in the integration schema can be associated with multiple candidate mappings: We consider a setting in which multiple matching mechanisms can be used, each of which could give rise to multiple mapping candidates for populating the same relation of the integration schema.

EDBT/ICDT  2010   5  

Page 6: Edbt 2010, Belhajjame

Outline  

  User  Feedback  

  Annota*on  of  Schema  Mappings  

  Selec*on  of  Schema  Mappings  Based  on  User  Requirements  

  Refinement  of  Schema  Mappings    

EDBT/ICDT  2010   6  

Page 7: Edbt 2010, Belhajjame

User  Feedback  

  Query:  What  are  the  available  fruit  fly  proteins?  

  Results:  

EDBT/ICDT  2010   7  

Feedback  

✔  

✖  

✖  

✔  

Page 8: Edbt 2010, Belhajjame

User  Feedback  (cont.)  

EDBT/ICDT  2010   8  

  Let  m  be  a  candidate  mapping,  and  UF  a  set  of  feedback  instances  UF  supplied  by  the  user:    

  tp(m,UF):  the  tuples  that  are  expected  by  the  user  and  that  are  retrieved  by  the  mapping  m.  

  fp(m,UF):  the  tuples  that  are  not  expected  by  the  user  and  that  are  retrieved  by  the  mapping  m.    

  fn(m,UF):  the  tuples  that  are  expected  by  the  user  and  are  not  retrieved  by  the  mapping  m.  

Page 9: Edbt 2010, Belhajjame

Outline  

 User  Feedback  

  Annota*on  of  Schema  Mappings  

  Selec*on  of  Schema  Mappings  Based  on  User  Requirements  

  Refinement  of  Schema  Mappings    

EDBT/ICDT  2010   9  

Page 10: Edbt 2010, Belhajjame

Annota2ng  Mappings  

Using  a  simple  annota*on  scheme,  a  schema  mapping  can  be  annotated  as:  

   Correct  

   Incorrect  

EDBT/ICDT  2010   10  

The  set  of  schema  mappings  is  likely  to  be  incomplete,  and,  therefore,  we  may  end  up  annota2ng  all  mappings  as  incorrect.  

Because  of  this,  we  use  a  less  stringent  scheme  mapping  annota2on.    

Page 11: Edbt 2010, Belhajjame

Annota2ng  Mappings  (cont.)  

Instead,  we  use  and  adapt  the  no2ons  of  precision  and  recall  used  in  informa2on  retrieval  to  measure  the  quality  of  a  mapping.  

   Precision:  

   Recall:  

   F  measure:  

EDBT/ICDT  2010   11  

Page 12: Edbt 2010, Belhajjame

Mapping  Annota2on:  Valida2on  

Ques*ons:    

– How  much  user  feedback  is  required  for  approxima8ng  the  real  precision  and  recall,  i.e.,  those  based  on  complete  knowledge  of  the  expected  results?  

– Does  the  pay-­‐as-­‐you-­‐go  philosophy  hold?  

EDBT/ICDT  2010   12  

Page 13: Edbt 2010, Belhajjame

Mapping  Annota2on:  Valida2on  (cont.)  

Experiment:  

  Data:  

–  Two  datasets:  the  Mondial  geographical  database  and  the  Amalgam  data  integra2on  benchmark  

–  Candidate  schema  mappings:  created  using  the  IBM  Infosphere  Data  Architect.    

  Process:  we  applied  the  two-­‐step  process  illustrated  below  for  mul2ple  itera2ons.  

1.  Generate  a  sample  feedback  instances.  

2.  Compute  the  rela2ve  precision  and  recall  of  the  candidate  mappings  given  cumula2ve  feedback.  

EDBT/ICDT  2010   13  

Page 14: Edbt 2010, Belhajjame

Mapping  Annota2on:  Error  in  Precision  

EDBT/ICDT  2010   14  

Error  

Page 15: Edbt 2010, Belhajjame

Mapping  Annota2on:  Error  in  Recall  

EDBT/ICDT  2010   15  

Error  

Page 16: Edbt 2010, Belhajjame

Outline  

 User  Feedback  

 Annota*on  of  Schema  Mappings  

  Selec*on  of  Schema  Mappings  Based  on  User  Requirements  

  Refinement  of  Schema  Mappings    

EDBT/ICDT  2010   16  

Page 17: Edbt 2010, Belhajjame

Mapping  Selec2on  

  Mapping  selec2on  should  be  tailored  to  meet  user  requirements.  

  We  use  a  selec2on  method  that  aims  to  maximise  the  recall  such  that  the  precision  of  the  results  is  higher  than  a  given  precision  threshold.  

  We  cast  this  selec2on  problem  as  a  search  problem  that  aims  to  maximise  the  following  u2lity  func2on:  

EDBT/ICDT  2010   17  

D.  A.  Menascé  and  V.  Dubey.  U2lity-­‐based  qos  brokering  in  service  oriented  architectures.  In  ICWS,  pages  422–430.  IEEE  CS,  2007.  

Page 18: Edbt 2010, Belhajjame

Mapping  Selec2on  

  Mapping  selec2on  should  be  tailored  to  meet  user  requirements.  

  We  use  a  selec2on  method  that  aims  to  maximise  the  recall  such  that  the  precision  of  the  results  is  higher  than  a  given  precision  threshold.  

  We  cast  this  selec2on  problem  as  a  search  problem  that  aims  to  maximise  the  following  u2lity  func2on:  

EDBT/ICDT  2010   18  

D.  A.  Menascé  and  V.  Dubey.  U2lity-­‐based  qos  brokering  in  service  oriented  architectures.  In  ICWS,  pages  422–430.  IEEE  CS,  2007.  

Page 19: Edbt 2010, Belhajjame

Mapping  Selec2on:  Precision  

EDBT/ICDT  2010   19  

Do  we  meet  precision  requirement,    i.e.,  is  the  precision  threshold  set  by  the  user  respected?  

Page 20: Edbt 2010, Belhajjame

Mapping  Selec2on:  Precision  

EDBT/ICDT  2010   20  

Page 21: Edbt 2010, Belhajjame

Mapping  Selec2on:  Recall  

EDBT/ICDT  2010   21  

Do  we  get  some  benefits  for  recall,    i.e.,  does  the  method  we  use  maximise  the  recall?  

Page 22: Edbt 2010, Belhajjame

Mapping  Selec2on:  Recall  

EDBT/ICDT  2010   22  

Page 23: Edbt 2010, Belhajjame

Outline  

 User  Feedback  

 Annota*on  of  Schema  Mappings  

 Selec*on  of  Schema  Mappings  Based  on  User  Requirements  

  Refinement  of  Schema  Mappings    

EDBT/ICDT  2010   23  

Page 24: Edbt 2010, Belhajjame

Mapping  Refinement  

  We  dis2nguish  two  kinds  of  refinement:    

  Mapping  refinement  that  seeks  to  reduce  the  number  of  false  posi2ves  

  A  candidate  mapping  is  refined  by  modifying  a  source  query  so  that  the  number  of  false  posi2ves  it  returns  is  reduced.    

  Mapping  refinement  that  aims  to  increase  the  number  of  true  posi2ves  

  A  candidate  mapping  m  is  refined  by  modifying  a  source  query  so  that  the  number  of  true  posi2ves  it  returns  is  increased.    

EDBT/ICDT  2010   24  

Page 25: Edbt 2010, Belhajjame

Mapping  Refinement:  Example  

EDBT/ICDT  2010   25  

Accession name gene

Protein

I Want Fruit fly proteins

Integration schema

Source schema

m = <Protein, ProteinEntry>

Page 26: Edbt 2010, Belhajjame

15/04/2009   Khalid   26  

Mapping  Refinement:  The  Space  of  Solu2ons  

The  space  of  solu2ons  is  composed  of  the  mappings  that  can  be  constructed  out  of  the  candidate  mappings.  Specifically:,  by  

i.   Joining  the  source  query  of  a  candidate  mapping.  

ii.   Augmen2ng  the  source  query  of  a  candidate  mapping  with  a  selec2on  condi2on.  

iii.   Relaxing  the  selec2on  condi2on  of  the  source  query  of  a  candidate  mapping.  

iv.   Combining  the  source  queries  of  two  or  more  mappings  using  union,  difference  and  intersec2on.  

Page 27: Edbt 2010, Belhajjame

15/04/2009   Khalid   27  

Exploring  the  Space  of  Solu2ons  

  The  space  of  mappings  that  can  be  obtained  by  refinement  is  

poten2ally  large.    

  A  search  algorithm  that  explores  the  whole  space  of  the  possible  

mappings  may  not  be  able  to  find  a  solu2on  in  a  bounded  2me.  

  In  the  context  of  the  present  work,  we  used  an  evolu*onary  algorithm  for  exploring  the  space  of  mappings  that  can  be  obtained  

by  refinement.  

Page 28: Edbt 2010, Belhajjame

Mapping  Refinement  Algorithm  

EDBT/ICDT  2010   28  

Page 29: Edbt 2010, Belhajjame

Mapping  Refinement:  Valida2on  

  Ques*on:      Can  mapping  refinement  improve  the  quality  of  ini8al  candidate  

mappings,  and,  if  so,  at  what  cost,  i.e.,  what  is  the  amount  of  user  feedback  required?  

  Experiment:  To  answer  the  above  ques2on  we  applied  the  following  process  for  mul2ple  itera2ons.  

1) Generate  a  sample  of  feedback  instances.  2) Annotate  the  set  of  candidate  mappings.  

3) Refine  candidate  mappings  using  the  RefineMappings  algorithm.  

EDBT/ICDT  2010   29  

Page 30: Edbt 2010, Belhajjame

Mapping  Refinement:  Valida2on  (cont.)  

EDBT/ICDT  2010   30  

Page 31: Edbt 2010, Belhajjame

Conclusions     Pay-­‐as-­‐you-­‐go  Annota*on  of  Schema  Mappings    We  showed  how  schema  mappings  can  be  incrementally  annotated  based  

on  feedback  supplied  by  end  users.  

  We  also  showed  through  an  evalua2on  exercise  that  the  more  feedback  the  user  supplies,  the  bemer  is  the  quality  of  the  mapping  annota2on  computed.    

    Applica*on:  Selec*on  and  Refinement  of  Schema  Mappings  in  Dataspaces  

  Mapping  annota2on  computed  based  on  user  feedback  are  used  as  input  for  enabling  the  selec2on  and  the  refinement  of  schema  mappings.  

  The  evalua2on  exercises  also  showed  that  mapping  refinement  is  more  cost  effec2ve  in  the  first  feedback  itera2ons.        

EDBT/ICDT  2010   31  

Page 32: Edbt 2010, Belhajjame

Feedback-Based Annotation, Selection and Refinement of Schema Mappings for Dataspaces

Khalid Belhajjame, Norman W. Paton, Suzanne M. Embury, Alvaro A. A. Fernandes, and Cornelia Hedeler

32  EDBT/ICDT  2010