ku leuven combines/connects sap hana, fuzzy search...

38
KU Leuven combines/connects SAP HANA, fuzzy search, gateway services and SAP UI5 to build apps for students and staff Nico Croes - Head of Development Coaching team 1 Kris Claes - Sr. Software Architect SAPience.be TECHday ‘15

Upload: lydien

Post on 14-Feb-2018

224 views

Category:

Documents


0 download

TRANSCRIPT

KU Leuven combines/connects SAP HANA, fuzzy search, gateway services and SAP UI5 to build apps for students

and staff

Nico Croes - Head of Development Coaching team

1

Kris Claes - Sr. Software Architect

SAPience.be TECHday ‘15

Agenda

•  Intro KU Leuven •  The project •  The flow

SAPience.be TECHday ‘15 2

UI5  

GW-­‐HUB  

GW-­‐Client  

ODATA  model  

Handler  Fuzzy  Frame  

ADBC  Frame  

ADBC  classes  

HANA  

KU Leuven & Association

>  102.000  students  >  22.000  staff  members     [PERC

ENTAGE]

[PERCENTAG

E]

Academic programmes

2%

Professional programmes

Arts

SAP @ KU Leuven

SRM (e-catalog)

CRM

Solution Manager

PI

BO BW

ca.  800  GB  data  

RRB PAY- TIME

PA-PD

e-recrui- ting

HR- FPM

FM

CO

AA

SD

MM

FSCM

FICA IS: SLcM

PM

PS IM

RE-FX

FI

ERP  

ca.  2  TB  data  

+  A  lot  of  custom  code    (workflow,  web  applicaSons,  interfaces,  …)  

Since  1999…  

Central IT Department

DIRECTORATE ICTS

Customer & Service Centre •  Customer &

service managers

•  IT Vendor & Purchasing mgmt

•  ICTS Helpdesk, communication & training

Facilities for Education, Research, Communication and Collaboration •  Inter- and Intranet •  Facilities for

Education •  Facilities for

Research •  Communication &

Collaboration •  Competence

Centre Information Security

Administrative Applications for General Management •  Finance •  Logistics •  Human

Resources •  SAP Basis &

ICTS

Administrative Applications for University management •  Students •  Education •  Individual Study

programmes & Exams

•  Research •  CRM

Local Network & Support

•  Local

Infrastructure System administration

•  Local Infrastructure support

•  PC Classrooms support

Central IT Infrastructure

•  System administration AIX

•  System administration UNIX

•  System administration Windows

•  Data Centre Network

Competence Centre

Management Information

ICTS Administrative

Office

Total: 215 FTE

•  System administration AIX

The project

"  Original requirement •  add some additional fields to existing BSP application

"  Objections •  Conflict with our internal guidelines concerning modifications of

‘old’ BSP programs •  The existing application is slow (search can take up to 20s…) •  The existing application is not MVC (tricky on modifications)

"  New proposal •  Refactoring of ‘old’ BSP application to new UI5 app •  Use HANA to speed up the search and data collection •  Offer 1 application to run on all devices (responsive) •  Separate UI (frontend developer) from processing data (backend

developer)

SAPience.be TECHday ‘15 6

Old BSP application

SAPience.be TECHday ‘15 7

Search  for  (bio)chemical  substances  by  name,  formula,  CASnumber  (in  SAP  EH&S)  è  detailed  sheet  with  informaSon  on  risks,  regulaSons,  safety  measures  è  print  labels  for  recipients  

New UI5 application

SAPience.be TECHday ‘15 8

Exact  search   Fuzzy  search  

The Flow UI5  

GW-­‐HUB  

GW-­‐Client  

ODATA  model  

Handler  Fuzzy  Frame  

ADBC  Frame  

ADBC  classes  

HANA  

UI5 application design

SAPience.be TECHday ‘15 10

" Use standard UI5 UI elements & concepts •  ‘Master detail’ design pattern •  View drill down to details https://sapui5.netweaver.ondemand.com/explored.html#/entity/sap.m.Table/samples

The Flow UI5  

GW-­‐HUB  

GW-­‐Client  

ODATA  model  

Handler  Fuzzy  Frame  

ADBC  Frame  

ADBC  classes  

HANA  

GW-Hub

SAPience.be TECHday ‘15 12

" Endpoint for the GW service " Location of the UI5 app "   The GW REST service can be be used by other (non-SAP)

applications https://admin.kuleuven.be/icts/services/dataservices

The Flow UI5  

GW-­‐HUB  

GW-­‐Client  

ODATA  model  

Handler  Fuzzy  Frame  

ADBC  Frame  

ADBC  classes  

HANA  

OData model

SAPience.be TECHday ‘15 14

•  Logical  data  model  (not  the  DB  model)  

•  Link  between  frontend  and  backend  (developers)  

Database Model - Search

15

TCGTPLREL  

ESTRH   ESTRI  

TCG22   TCG24   TCG11  

ESTVA   ESTVH  

TCG53   TCG12  

AUSP  

ESTPH  

ESTPP  

ESTPJ  

ESTPS  

ESTPT  

CABN  

CABNT  

KSML  

KLAH  

ZLT_BIG_LBL  

ZLT_BIG_LBL_UNIT  

ZLT_BIG_LBL_PHR  

T006  

GW implementation SEGW

SAPience.be TECHday ‘15 16

Data  enSty  types  

Service  implementaSon  

•  Different  service  implementaSons    are  ‘grouped’  into  a  handler/model  class  

The Flow UI5  

GW-­‐HUB  

GW-­‐Client  

Handler  

Fuzzy  Frame  

ADBC  Frame  

ADBC  classes  

HANA  

Improving Existing Search Help

Performance   Only  ‘excact’  search  

Typo’s    

Ethyleen  Etyleen  Ethylene  ESlien  …  

What data do we have?

EH&S  –  hazardous  substances  –  database  •  Approximately  750,000  records  •  Text  fields  with  product  names,  synonyms,  formulas  •  Some  typos    

Database Model - Fetch

20

TCGTPLREL  

ESTRH   ESTRI  

TCG22   TCG24   TCG11  

ESTVA   ESTVH  

TCG53   TCG12  

AUSP  

ESTPH  

ESTPP  

ESTPJ  

ESTPS  

ESTPT  

CABN  

CABNT  

KSML  

KLAH  

ZLT_BIG_LBL  

ZLT_BIG_LBL_UNIT  

ZLT_BIG_LBL_PHR  

T006  

HANA – Fuzzy Search

Fuzzy  search  can  be  used  in  various  applicaSons,  for  example:  •  Fault-­‐tolerant  search  in  text  columns  (html  or  pdf  for  example):  Search  for  documents  on  

'Driethanolamyn'  and  find  all  documents  that  contain  the  term  'Triethanolamine'.  •  Fault-­‐tolerant  search  in  structured  database  content:  Search  for  a  product  called  'coffe  krisp  

biscuit'  and  find  'Toffee  Crisp  Biscuits'.  •  Fault-­‐tolerant  check  for  duplicate  records:  Before  creaSng  a  new  customer  record  in  a  CRM  

system,  search  for  similar  customer  records  and  verify  that  no  duplicates  are  already  stored  in  the  system.    When  creaSng  a  new  record  called  'SAB  AkSengesellschak  &  Co  KG  Deutschl.'  in  'Wahldorf'  for  example,  the  system  would  bring  up  'SAP  Deutschland  AG  &  Co.  KG'  in  'Walldorf'  as  a  possible  duplicate.    

Fuzzy  Search  is  a  fast  and  fault-­‐tolerant  search  feature  for  SAP  HANA.    The  term  ”fault-­‐tolerant  search”  means  that  a  database  query  returns  records  even  if  the  search  term  (the  user  input)  contains  addiSonal  or  missing  characters,  or  other  types  of  spelling  error.  

You  can  call  the  fuzzy  search  by  using  the  CONTAINS  predicate  with  the  FUZZY  opSon  in  the  WHERE  clause  of  a  SELECT  statement.  

Search queries with CONTAINS

... where contains( ident, 'ethyleen‘ , FUZZY(0.5) )

... where contains( ident, 'ethyleen‘ , LINGUISTIC )

... where contains( ident, 'ethyleen‘ , EXACT )

“A  linguis:c  search  finds  all  words  that  have  the  same  word  stem  as  the  search  term.  It  also  finds  all  words  for  which  the  search  term  is  the  word  stem.  In  the  SELECT  statement  of  the  full-­‐text  search  query,  you  can  specify  the  LINGUISTIC  search  type.  When  you  execute  a  linguis:c  search,  the  system  has  to  determine  the  stems  of  the  searched  terms.  It  will  look  up  the  stems  in  the  stem  dic:onary.  The  hits  in  the  stem  dic:onary  point  to  all  words  in  the  word  dic:onary  that  have  this  stem”  

Basic Fuzzy

select    score()  as  score,    *    from  z_chemical_nofuzz    where  contains  (  product  ,    'ethyleen'  ,    fuzzy(  0.7  )  )    order  by  score  desc    

select    *    from  z_chemical_nofuzz    where  product  like  '%ethyleen%'    order  by  product    ;  

No  typos  

No  longer  texts  

Fuzzy Search on String Columns

Not  all  opSons  are  available  on  string  columns!  

select    score()  as  score,    *    from  z_chemical_nofuzz    where  contains  (  product  ,    'ethyleen'  ,    fuzzy(  0.7  ,  'textsearch=compare'  )  )    order  by  score  desc  

Could  not  execute  'select  score()  as  score,  *  from  z_chemical_nofuzz  where  contains  (  product  ,  'ethyleen'  ,  fuzzy(  ...'  in  4  ms  709  µs  .    SAP  DBTech  JDBC:  [2048]:  column  store  error:  search  table  error:    [2018]  Option  'textSearch'  not  allowed  for  column  'PRODUCT'    

You  can  make  this  available  by  building  a  Fuzzy  Search  Index  on  the  required  columns  

String  types  

Basic    Fuzzy  

Text  types  

SophisScated  fuzzy  

Date  types  

SophisScated  fuzzy  

Fuzzy Search Index

•  CreaSon  by  SQL  •  CreaSon  in  the  HANA  Studio:  has  less  possibiliSes  

CREATE FULLTEXT INDEX <index_name> ON <tableref> '(' <column_name> ')' [<fulltext_parameter_list>] Specify any of the following additional parameters for the full-text index: LANGUAGE COLUMN <column_name> LANGUAGE DETECTION '(' <string_literal_list> ')' MIME TYPE COLUMN <column_name> FUZZY SEARCH INDEX <on_off> PHRASE INDEX RATIO <on_off> CONFIGURATION <string_literal> SEARCH ONLY <on_off> FAST PREPROCESS <on_off> TEXT MINING <on_off> TEXT MINING CONFIGURATION <string_literal> TEXT ANALYSIS <on_off> MIME TYPE <specified mime type, e.g. application/pdf> TOKEN SEPARATORS <\/;,.:-_()[]<>!?*@+{}="&>

CREATE  FULLTEXT  INDEX  "INDENT"  ON  "UX000657"."Z_ESTRI"  ("IDENT")  SYNC  PHRASE  INDEX  RATIO  0.200000  FUZZY  SEARCH  INDEX  OFF  SEARCH  ONLY  ON  FAST  PREPROCESS  ON  TEXT  MINING  OFF  TEXT  ANALYSIS  OFF  TOKEN  SEPARATORS  '\/;,.:-­‐_()[]<>!?*@+{}="&#$~|'  

Basic Fuzzy - again

select    score()  as  score,    *    from  z_chemical  where  contains  (  product  ,    'ethyleen'  ,    fuzzy(  0.7  )  )    order  by  score  desc    

select    *    from  z_chemical    where  product  like  '%ethyleen%'    order  by  product    ;  

Yes!  longer  texts  

SSll  no  typos  

How does it work?

SELECT * FROM WHERE CONTAINS ( (<col1>, <col2>, <col3> ) ,<search_string> )

•  Tokenize  using  the  token-­‐separators.    •  Similarity:  defined  by  the  number  of  common  characters,  wrong  characters,  addiSonal  

characters  in  search  string  and  reference  string.  •  StandardizaEon:  translaSon  to  lower  case  characters  without  diacriScs.    

•  Possible  to  get  a  match  when  comparing  2  unequal  terms  (  “Café”  =  “café”  ).  •  High  fuzzy  scores  for  common  differences  in  the  spelling  of  words.  

•  Influenced  by  some  addiEonal  parameters  •  excessTokenWeight    •  andThreshold  •  …    

Fuzzy Score

The  higher  the  score,  the  more  similar  the  strings  are.  A  score  of  1.0  means  the  strings  are  idenScal.  A  score  of  0.0  means  the  strings  have  nothing  in  common.  

You  can  request  the  score  in  the  SELECT  statement  by  using  the  SCORE()  funcSon.  You  can  sort  the  results  of  a  query  by  score  in  descending  order  to  get  the  best  records  first  (the  best  record  is  the  record  that  is  most  similar  to  the  user  input).  If  a  fuzzy  search  of  mulSple  columns  is  used  in  a  SELECT  statement,  the  score  is  returned  as  an  average  of  the  scores  of  all  columns  used.  

When  searching  text  columns,  a  TF/IDF  (term  frequency/inverse  document  frequency)  score  is  returned  by  default  instead  of  the  fuzzy  score.  

The  TF/IDF  calculaSon  can  be  disabled  so  that  you  get  the  fuzzy  score  instead.  In  parScular,  this  makes  sense  for  short-­‐text  columns  containing  data  such  as  product  names  or  company  names.  

New result

select  distinct    score()  as  score,    ident    from  z_estri  where  contains(  ident,  'ethyleen'  ,  fuzzy(0.6  ,  'textsearch=compare'  )  )    order  by  score  desc,  ident  

Our  typos  get  top  scores  …  …  but  we  lose  points  in  the  longer  texts  

excessTokenWeight

•  excessTokenWeight  defines  the  weight  of  excess  (that  is,  unassigned)  tokens.  It  is  set  to  1.0  by  default.    

•  Excess  tokens  are  tokens  that  do  not  have  a  counterpart  token  on  either  the  input  side  or  the  request  side.    

•  This  parameter  enables  a  beuer  sorSng  by  score  when  the  lengths  (that  is,  the  number  of  tokens)  of  the  request  entry  and  the  reference  entry  are  different.  

•  We  want  to  select  the  data  regardless  of  the  excess  tokens.  •  But  a  record  without  excess  tokens  must  be  more  exact  than  one  with  excess  tokens.  

excessTokenWeight

where  contains(  ident,    'ethyleen'  ,    fuzzy(0.6  ,    'textsearch=compare,  excessTokenWeight=1.0'  )  )    

•  Default  =  1.0  

where  contains(  ident,    'ethyleen'  ,    fuzzy(0.6  ,    'textsearch=compare,  excessTokenWeight=0.1'  )  )    

andThreshold

•  specify  a  'parSal  AND'    •  The  'andThreshold'  parameter  defines  the  percentage  of  tokens  that  have  to  match    

•  andThreshold  =  1.0  à  all  tokens  have  to  match,  'strict  AND'    •  0.0  <  andThreshold  <  1.0  à  some  of  the  tokens  have  to  match,  'sok  AND’  •  andThreshold  =  0.0  à  at  least  one  token  has  to  match,  'OR'    

•  The  parameter  influences  performance.  

andThreshold

where  contains(  ident,    'zuiver  ethyleen'  ,    fuzzy(0.6  ,    'textsearch=compare,  excessTokenWeight=0.1  ,andThreshold=1.0'  )  )    

where  contains(  ident,    'zuiver  ethyleen'  ,    fuzzy(0.6  ,    'textsearch=compare,  excessTokenWeight=0.1  ,andThreshold=0.0'  )  )    

•  Default  =  1.0  =  strict  AND   •  0.0  =  OR  

(de)composeWords

•  composeWords:  how  words  in  the  user  input  are  combined  into  compound  words  •  decomposeWords:  how  words  in  the  user  input  are  split  into  separate  words,  building  a  

decomposiSon  phrase  •  compoundWordWeight:  how  compound  word  hits  affect  the  score  of  a  document  

•  The  parameter  influences  performance.  

composeWords  =  3   decomposeWords  =  3  

Van  der  weyden   Vander  weyden   Vanderweyden   Va  nd  erweyden  

Van  derweyden   …  

Vanderweyden   Vander  weyden  

Van  derweyden  

ABAP

Could  it  be  this  easy?     SELECT * FROM estri INTO TABLE DATA(lt_estri) WHERE contains (ident , 'ethyleen'  ,  FUZZY  (0.8)   ).    

Not  available  in  ABAPOpen  SQL:      •  hup://scn.sap.com/thread/3757646    •  hup://scn.sap.com/community/abap/hana/blog/2012/12/28/sap-­‐teched-­‐2012-­‐

abap-­‐for-­‐sap-­‐hana-­‐how-­‐to-­‐exploit-­‐the-­‐power-­‐of-­‐sap-­‐hana    

ABAP – ADBC-interface

SAPience.be TECHday ‘15 36

•  Prepare  statement  in  HANA  Studio  •  Own  Framework  as  wrapper  of  ADBC  interface  reduces  programming  effort  

DATA(lr_data) = lr_object->execute( ).    

SAP Documentation

SAP  HANA  Search  Developer  Guide  (24/06/2015)  hup://help.sap.com/hana/SAP_HANA_Search_Developer_Guide_en.pdf  

SAP  HANA  Developer  Guide  (28/05/2014)  hup://hcp.sap.com/content/dam/website/saphana/en_us/Technology%20Documents/SAP_HANA_Developer_Guide_en.pdf  (Chapter  10)  

SAP  HANA  Fuzzy  Search  Reference  (Help  Portal)  hup://help.sap.com/saphelp_hanapla{orm/helpdata/en/27/b6f00d4d4744d1b3dcfdea68e0eb0a/frameset.htm    

Thank you!

SAPience.be TECHday ‘15 38