data vault automation at - bi-podium...data vault automation at de bijenkorf ... dynamodb!!...

21
Data Vault Automation at de Bijenkorf PRESENTED BY ROB WINTERS ANDREI SCORUS

Upload: others

Post on 22-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Vault Automation at - BI-Podium...Data Vault Automation at de Bijenkorf ... DynamoDB!! OpenSource Snowplow!EventTracker! RundeckScheduler! JenkinsConnuous Integraon! Pentaho!PDI!

Data Vault Automation at

de Bijenkorf PRESENTED BY

ROB WINTERS ANDREI SCORUS

Page 2: Data Vault Automation at - BI-Podium...Data Vault Automation at de Bijenkorf ... DynamoDB!! OpenSource Snowplow!EventTracker! RundeckScheduler! JenkinsConnuous Integraon! Pentaho!PDI!

Presentation agenda ◦ Project  objec*ves  

◦ Architectural  overview  

◦ The  data  warehouse  data  model  

◦ Automa*on  in  the  data  warehouse  

◦ Successes  and  failures  

◦ Conclusions  

Page 3: Data Vault Automation at - BI-Podium...Data Vault Automation at de Bijenkorf ... DynamoDB!! OpenSource Snowplow!EventTracker! RundeckScheduler! JenkinsConnuous Integraon! Pentaho!PDI!

About the presenters   Rob  Winters  

  Head  of  Data  Technology,  the  Bijenkorf  

  Project  role:  ◦ Project  Lead  ◦  Systems  architect  and  administrator  

◦  Data  modeler  

◦  Developer  (ETL,  predic;ve  models,  reports)  

◦  Stakeholder  manager  

◦ Joined  project  September  2014  

Andrei  Scorus  

  BI  Consultant,  Incentro  

  Project  role:  ◦ Main  ETL  Developer  ◦  ETL  Developer  ◦  Modeling  support  

◦  Source  system  expert  

◦ Joined  project  November  2014  

Page 4: Data Vault Automation at - BI-Podium...Data Vault Automation at de Bijenkorf ... DynamoDB!! OpenSource Snowplow!EventTracker! RundeckScheduler! JenkinsConnuous Integraon! Pentaho!PDI!

Project objectives

◦   Informa*on  requirements  ◦  Have  one  place  as  the  source  for  all  

reports  

◦  Security  and  privacy  ◦  Informa*on  management  

◦  Integrate  with  produc*on  ◦ Non-­‐func*onal  requirements  ◦  System  quality  

◦  Extensibility  ◦  Scalability  ◦  Maintainability  

◦  Security  ◦  Flexibility  ◦  Low  Cost  

Technical  Requirements  

•  One  environment  to  quickly  generate  customer  insights  

•  Then  feed  those  insights  back  to  produc;on  

•  Then  measure  the  impact  of  those  changes  in  near  real  ;me  

Page 5: Data Vault Automation at - BI-Podium...Data Vault Automation at de Bijenkorf ... DynamoDB!! OpenSource Snowplow!EventTracker! RundeckScheduler! JenkinsConnuous Integraon! Pentaho!PDI!

Source system landscape

Source  Type   Number  of  Sources  

Examples   Load  Frequency  

Data  Structure  

Oracle  DB   2   Virgo  ERP   2x/hour   Par;al  3NF  

MySQL   3   Product  DB,  Web  Orders,  DWH  

10x/hour   3NF  (Web  Orders),  Improperly  normalized  

Event  bus   1   Web/email  events  

1x/minute   Tab  delimited  with  JSON  fields  

Webhook   1   Transac;onal  Emails  

1x/minute   JSON  

REST  APIs   5+   GA,  DotMailer   1x/hour-­‐1x/day   JSON  

SOAP  APIs   5+   AdWords,  Pricing   1x/day   XML  

Page 6: Data Vault Automation at - BI-Podium...Data Vault Automation at de Bijenkorf ... DynamoDB!! OpenSource Snowplow!EventTracker! RundeckScheduler! JenkinsConnuous Integraon! Pentaho!PDI!

Architectural overview   Tools  

  AWS  ◦  S3  ◦  Kinesis  ◦  Elas;cache  ◦  Elas;c  Beanstalk  ◦  EC2  ◦  DynamoDB  

  Open  Source  ◦  Snowplow  Event  Tracker  ◦  Rundeck  Scheduler  ◦  Jenkins  Con;nuous  

Integra;on  ◦  Pentaho  PDI  

  Other  ◦  HP  Ver;ca  ◦  Tableau  ◦  Github  ◦  RStudio  Server  

Page 7: Data Vault Automation at - BI-Podium...Data Vault Automation at de Bijenkorf ... DynamoDB!! OpenSource Snowplow!EventTracker! RundeckScheduler! JenkinsConnuous Integraon! Pentaho!PDI!

DWH internal architecture

•  Tradi;onal  three  ;er  DWH  •  ODS  generated  

automa;cally  from  staging  •  Ops  mart  reflects  data  in  

original  source  form  •  Helps  offload  queries  

from  source  systems  •  Business  marts  

materialized  exclusively  from  vault  

Page 8: Data Vault Automation at - BI-Podium...Data Vault Automation at de Bijenkorf ... DynamoDB!! OpenSource Snowplow!EventTracker! RundeckScheduler! JenkinsConnuous Integraon! Pentaho!PDI!

Bijenkorf Data Vault overview   Data  volumes  

• ~1  TB  base  volume  

• 10-­‐12  GB  daily  

• ~250  source  tables  

  Aligned  to  Data  Vault  2.0  

• Hash  keys  

• Hashes  used  for  CDC  

• Parallel  loading  

• Maximum  u;liza;on  of  available  resources  

• Data  unchanged  in  to  the  vault  

   Some  sta*s*cs  

  18  hubs  •  34  loading  scripts  

  27  links  •  43  loading  scripts  

  39  satellites  •  43  loading  scripts  

  13  reference  tables  •  1  script  per  table  

  Model  contains  

• Sales  transac;ons  

• Customer  and  corporate  loca;ons  

• Customers  

• Products  

• Payment  methods  

• E-­‐mail  

• Phone  

• Product  grouping  

• Campaigns  

• deBijenkorf  card  

• Social  media  

   Excluded  from  the  vault  ◦  Event  streams  

◦  Server  logs  ◦  Unstructured  data  

Page 9: Data Vault Automation at - BI-Podium...Data Vault Automation at de Bijenkorf ... DynamoDB!! OpenSource Snowplow!EventTracker! RundeckScheduler! JenkinsConnuous Integraon! Pentaho!PDI!

Deep dive: Transactions in DV • Transac;ons    

Page 10: Data Vault Automation at - BI-Podium...Data Vault Automation at de Bijenkorf ... DynamoDB!! OpenSource Snowplow!EventTracker! RundeckScheduler! JenkinsConnuous Integraon! Pentaho!PDI!

Deep dive: Customers in DV • Same  as  link  on  customer  

Page 11: Data Vault Automation at - BI-Podium...Data Vault Automation at de Bijenkorf ... DynamoDB!! OpenSource Snowplow!EventTracker! RundeckScheduler! JenkinsConnuous Integraon! Pentaho!PDI!

Challenges encountered during data modeling Challenge   Issue  Details   Resolu*on  

Source  issues   •  Source  systems  and  original  data  unavailable  for  most  informa;on  

•  Data  ohen  transformed  2-­‐4  ;mes  before  access  was  available  

•  Business  keys  (ex.  SKU)  typically  replaced  with  sequences  

•  Business  keys  rebuilt  in  staging  prior  to  vault  loading  

Modeling  returns  

•  Retail  returns  can  appear  in  ERP  in  1-­‐3  ways  across  mul;ple  tables  with  inconsistent  keys  

•  Online  returns  appear  as  a  state  change  on  original  transac;on  and  may/may  not  appear  in  ERP  

•  Original  model  showed  sale  state  on  line  item  satellite  

•  Revised  model  recorded  “nega;ve  sale”  transac;ons  and  used  a  new  link  to  connect  to  original  sale  when  possible  

Fragmented  knowledge  

•  Informa;on  about  the  systems  was  being  held  by  mul;ple  people  

•  Documenta;on  was  out-­‐of-­‐date  

•  Talking  to  as  many  people  as  possible  and  tes;ng  hypotheses  on  the  data  

Page 12: Data Vault Automation at - BI-Podium...Data Vault Automation at de Bijenkorf ... DynamoDB!! OpenSource Snowplow!EventTracker! RundeckScheduler! JenkinsConnuous Integraon! Pentaho!PDI!

Targeted benefits of DWH automation Objec*ve   Achievements  

Speed  of  development   •  Integra;on  of  new  sources  or  data  from  exis;ng  sources  takes  1-­‐2  steps  

•  Adding  a  new  vault  dependency  takes  one  step  

Simplicity   •  Five  jobs  handle  all  ETL  processes  across  DWH  

Traceability   •  Every  record/source  file  is  traced  in  the  database  and  every  row  automa;cally  iden;fied  by  source  file  in  ODS  

Code  simplifica*on   •  Replaced  most  common  key  defini;ons  with  dynamic  variable  replacement  

File  management   •  Every  source  file  automa;cally  archived  to  Amazon  S3  in  appropriate  loca;ons  sorted  by  source,  table,  and  date  

•  En;re  source  systems,  periods,  etc  can  be  replayed  in  minutes  

Page 13: Data Vault Automation at - BI-Podium...Data Vault Automation at de Bijenkorf ... DynamoDB!! OpenSource Snowplow!EventTracker! RundeckScheduler! JenkinsConnuous Integraon! Pentaho!PDI!

Source loading automation o   Design  of  loader  focused  on  process  abstrac;on,  traceability,  and  minimiza;on  of  “moving  parts”  

o   Final  process  consisted  of  two  base  jobs  working  in  tandem:  one  for  genera;ng  incremental  extracts  from  source  systems,  one  for  loading  flat  files  from  all  sources  to  staging  tables  o Replica;on  was  desired  but  rejected  due  to  limited  access  to  source  systems  

Source  tables  duplicated  in  staging  with  addi;on  of  loadTs  and  sourceFile  columns  

Metadata  for  source  file  added  

Loader  automa;cally  generates  ODS,  begins  tracking  source  files  for  duplica;on  and  data  quality  

Query  generator  

automa;cally  executes  full  duplica;on  on  first  execu;on  

and  incrementals  aherward  

CREATE  TABLE  stg_oms.customer  (  customerId  int  ,  customerName  varchar(500)  ,  customerAddress  varchar(5000)  ,  loadTs  ;mestamp  NOT  NULL  ,  sourceFile  varchar(255)  NOT  NULL  )  ORDER  BY  customerId  PARTITION  BY  date(loadTs)  ;  INSERT  INTO  meta.source_to_stg_mapping  (targetSchema,  targetTable,  sourceSystem,  fileNamePapern,  delimiter,  nullField)  VALUES  ('stg_oms','customer','OMS','OMS_CUSTOMER','TAB','NULL')  ;  

Example:  Add  addi*onal  table  from  exis*ng  source  Workflow  of  source  integra*on  

Page 14: Data Vault Automation at - BI-Podium...Data Vault Automation at de Bijenkorf ... DynamoDB!! OpenSource Snowplow!EventTracker! RundeckScheduler! JenkinsConnuous Integraon! Pentaho!PDI!

Vault loading automation

• New  sources  automa;cally  added  

• Last  change  epoch  based  on  load  stamps,  advanced  each  ;me  all  dependencies  execute  successfully  

All  Staging  Tables  Checked  

for  Changes  

• Dependencies  declared  at  ;me  of  job  crea;on  

• Load  priori;za;on  possible  but  not  u;lized  

List  of  Dependent  Vault  Loads  

Iden;fied  

•  Jobs  parallelized  across  tables  but  serialized  per  job  

• Dynamic  job  queueing  ensures  appropriate  execu;on  order  

Loads  Planned  in  Hub,  Link,  Sat  Order  

• Variables  automa;cally  iden;fied  and  replaced  

• Each  load  records  performance  sta;s;cs  and  error  messages  

Loads  Executed  

o   Loader  is  fully  metadata  driven  with  focus  on  horizontal  scalability  and  management  simplicity  

o   To  support  speed  of  development  and  performance,  variable-­‐driven  SQL  templates  used  throughout  

Page 15: Data Vault Automation at - BI-Podium...Data Vault Automation at de Bijenkorf ... DynamoDB!! OpenSource Snowplow!EventTracker! RundeckScheduler! JenkinsConnuous Integraon! Pentaho!PDI!

Design goals for mart loading automation

Requirement   Solu;on   Benefit  

Simple,  standardized  

models  

Metadata-­‐driven  Pentaho  PDI  

Easy  development  

using  parameters  and  variables  

Easily  Extensible  

Plugin  framework  

Rapid  integra;on  of  new  

func;onality  

Rapid  new  job  development  

Recycle  standardized  jobs  

and  transforma;ons  

Limited  moving  parts,  easy  modifica;on  

Low  administra;on  

overhead  

Leverage  built  in  logging  and  tracking  

Easily  integrated  mart  loading  repor;ng  with  

other  ETL  reports  

Page 16: Data Vault Automation at - BI-Podium...Data Vault Automation at de Bijenkorf ... DynamoDB!! OpenSource Snowplow!EventTracker! RundeckScheduler! JenkinsConnuous Integraon! Pentaho!PDI!

Data Information mart automation flow Retrieve  

commands  

• Each  dimension  and  fact  is  processed  independently  

Get  dependencies  

• Based  on  defined  transforma;on,  get  all  related  vault  tables:  links,  satellites  or  hubs  

Retrieve  changed  data  

• From  the  related  tables,  build  a  list  of  unique  keys  that  have  changed  since  the  last  update  of  the  fact  or  dimension  • Store  the  data  in  the  database  un;l  further  processing  

Execute  transforma*ons  

• Mul;ple  Pentaho  transforma;ons  can  be  processed  per  command  using  the  data  captured  in  previous  steps  

Maintentance  

• Logging  happens  throughout  the  whole  process  • Cleanup  aher  all  commands  have  been  processed  

Page 17: Data Vault Automation at - BI-Podium...Data Vault Automation at de Bijenkorf ... DynamoDB!! OpenSource Snowplow!EventTracker! RundeckScheduler! JenkinsConnuous Integraon! Pentaho!PDI!

Primary uses of Bijenkorf DWH Cu

stom

er  Analysis   •  Provided  first  

unified  data  model  of  customer  ac;vity  

•  80%  reduc;on  in  unique  customer  keys  

• Allowed  for  segmenta;on  of  customers  based  on  combina;on  of  in-­‐store  and  online  ac;vity  

Person

aliza;

on  

• DV  drives  recommenda;on  engine  and  customer  recommenda;ons  (updated  nightly)  

• Data  pipeline  supports  near  real  ;me  upda;ng  of  customer  recommenda;ons  based  on  web  ac;vity   Bu

sine

ss  Intelligence   • DV-­‐based  marts  

replace  joining  dozens  of  tables  across  mul;ple  sources  with  single  facts/  dimensions  

•  IT-­‐driven  repor;ng  being  replaced  with  self-­‐service  BI  

Page 18: Data Vault Automation at - BI-Podium...Data Vault Automation at de Bijenkorf ... DynamoDB!! OpenSource Snowplow!EventTracker! RundeckScheduler! JenkinsConnuous Integraon! Pentaho!PDI!

Biggest drivers of success AWS  Infrastructure  

Cost:  En;re  infrastructure  for  less  than  one  server  in  the  data  center  

Toolset:  Most  services  available  off  the  shelf,  minimizing  administra;on  

Freedom:  No  dependency  on  IT  for  development  support  

Scalability:  Systems  automa;cally  scaled  to  match  DWH  demands  

Automa;on  

Speed:  Enormous  ;me  savings  aher  ini;al  investment  

Simplicity:  Able  to  run  and  monitor  40k+  queries  per  day  with  minimal  effort  

Auditability:  Enforced  tracking  and  archiving  without  developer  involvement  

PDI  framework  

Ease  of  use:  Adding  new  commands  takes  at  most  45  minutes  

Agile:  Building  the  framework  took  1  day  

Low  profile:  Average  memory  usage  of  250MB  

Page 19: Data Vault Automation at - BI-Podium...Data Vault Automation at de Bijenkorf ... DynamoDB!! OpenSource Snowplow!EventTracker! RundeckScheduler! JenkinsConnuous Integraon! Pentaho!PDI!

Biggest mistakes along the way

•  Ini;al  integra;on  design  was  based  on  provided  documenta;on/models  which  was  rarely  accurate  

•  Current  users  of  sources  should  have  been  engaged  earlier  to  explain  undocumented  caveats  

Reliance  on  documenta;on  and  requirements  over  expert  users  

•  Variables  were  u;lized  late  in  development,  slowing  progress  significantly  and  crea;ng  consistency  issues  

•  Good  ini;al  design  of  templates  will  significantly  reduce  development  ;me  in  mid/long  run  

Late  u;liza;on  of  templates  and  variables  

• We  apempted  to  design  and  populate  the  en;re  data  vault  prior  to  focusing  on  customer  deliverables  like  reports  (in  addi;on  to  other  projects)  

• We  have  shihed  focus  to  con;nuous  release  of  new  informa;on  rather  than  wai;ng  for  completeness  

Aggressive  overextension  of  resources  

Page 20: Data Vault Automation at - BI-Podium...Data Vault Automation at de Bijenkorf ... DynamoDB!! OpenSource Snowplow!EventTracker! RundeckScheduler! JenkinsConnuous Integraon! Pentaho!PDI!

Primary takeaways

◦ Sources  are  like  cars:  the  older  they  are,  the  more  idiosyncrasies.  Be  cau;ous  with  design  automa;on!  

◦ Automa;on  can  enormously  simplify/accelerate  data  warehousing.  Don’t  be  afraid  to  roll  your  own  

◦ Balance  stateful  versus  stateless  and  monolithic  versus  fragmented  architecture  design  

◦ Cloud  based  architecture  based  on  column  store  DBs  is  extremely  scalable,  cheap,  and  highly  performant  

◦ A  successful  vault  can  create  a  new  problem:  gewng  IT  to  think  about  business  processes  rather  than  system  keys!  

Page 21: Data Vault Automation at - BI-Podium...Data Vault Automation at de Bijenkorf ... DynamoDB!! OpenSource Snowplow!EventTracker! RundeckScheduler! JenkinsConnuous Integraon! Pentaho!PDI!

Rob  Winters  [email protected]  

Andrei  Scorus  [email protected]