value extraction from bbva credit card transactions. ivan de prado at big data spain 2012

33
Value extraction from BBVA credit card transactions Iván de Prado Alonso – CEO of Datasalt www.datasalt.es @ivanprado @datasalt www.bigdataspain.org November 16 th , 2012 ETSI Telecomunicación Madrid Spain #BDSpain

Upload: big-data-spain

Post on 28-Nov-2014

895 views

Category:

Technology


0 download

DESCRIPTION

Session presented at Big Data Spain 2012 Conference 16th Nov 2012 ETSI Telecomunicacion UPM Madrid www.bigdataspain.org More info: http://www.bigdataspain.org/es-2012/conference/value-extraction-from-bbva-credit-card-transactions/ivan-de-prado

TRANSCRIPT

Page 1: Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Data Spain 2012

Value extraction from BBVA credit card transactions

Iván  de  Prado  Alonso  –  CEO  of  Datasalt  www.datasalt.es  @ivanprado  @datasalt  

www.bigdataspain.org  November  16th,  2012  ETSI  Telecomunicación    Madrid  Spain  #BDSpain  

Page 2: Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Data Spain 2012
Page 3: Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Data Spain 2012

BIG  “MAC”  DATA  

Page 4: Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Data Spain 2012

104,000  employees  47  million  customers  

Page 5: Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Data Spain 2012

The  idea  

Extract  value  from  

anonymized  credit  card  transacNons  data  &  share  it      

Always:    ü  Impersonal  ü  Aggregated  ü  Dissociated  ü  Irreversible  

Page 6: Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Data Spain 2012

Helping  

Consumers  

Sellers  

Informed  decision  ü  Shop  recommendaNons  (by  locaNon  and  by  category)  ü  Best  Nme  to  buy  ü  AcNvity  &  fidelity  of  shop’s  customers  

Learning  clients  paCerns  ü  AcNvity  &  fidelity  of  shop’s  customers  ü  Sex  &  Age  &  LocaNon  ü  Buying  paXerns  

Page 7: Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Data Spain 2012

Shop  stats   For  different  periods  ü  All,  year,  quarter,  month,  week,  day  

…  and  much  more  

Page 8: Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Data Spain 2012

The  applicaNons  

Customers  

Internal  use  

Sellers  

Page 9: Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Data Spain 2012

The  challenges  

Company  silos  

The  amount  of  data  

The  costs  

Security  

Development  flexibility/agility  

Human  failures  

Page 10: Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Data Spain 2012

The  pla]orm  

S3  Data  storage  ElasNc  Map  Reduce  Data  processing  

EC2  Data  serving  

Page 11: Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Data Spain 2012

The  architecture  

Page 12: Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Data Spain 2012

Hadoop  

Distributed  Filesystem  ü  Files  as  big  as  you  want  ü  Horizontal  scalability  ü  Failover    

Distributed  CompuNng  ü  MapReduce  ü  Batch  oriented  

•  Input  files  processed  and  converted  in  output  files  ü  Horizontal  scalability    

Page 13: Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Data Spain 2012

Easier  Hadoop  Java  API  ü  But  keeping  similar  efficiency  

Common  design  paXerns  covered  ü  Compound  records  ü  Secondary  sorNng  ü  Joins  

Other  improvements  ü  Instance  based  configuraNon  ü  First  class  mulNple  input/output  

Tuple  MapReduce  implementaJon  for  Hadoop  

Page 14: Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Data Spain 2012

Tuple  MapReduce  

Pere  Ferrera,  Iván  de  Prado,  Eric  Palacios,  Jose  Luis  Fernandez-­‐Marquez,  Giovanna  Di  Marzo  Serugendo:      Tuple  MapReduce:  Beyond  classic  MapReduce.      In  ICDM  2012:  Proceedings  of  the  IEEE  Interna2onal  Conference  on  Data  Mining    Brussels,  Belgium  |  December  10  –  13,  2012  

Our  evoluJon  to  Google’s  MapReduce  

Page 15: Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Data Spain 2012

Tuple  MapReduce   Sales  difference  between  the  most  selling  offices  per  each  loca2on  

Page 16: Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Data Spain 2012

Tuple  MapReduce  

Main  constraint  

ü  Group  by  clause  must  be  a  subset  of  sort  by  clause  

Indeed,  Tuple  MapReduce  can  be  implemented  on  top  of  any  MapReduce  implementaJon  

•  Pangool  -­‐>  Tuple  MapReduce  over  Hadoop  

Page 17: Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Data Spain 2012

Efficiency  

hXp://pangool.net/benchmark.html  

Similar  efficiency  to  Hadoop  

Page 18: Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Data Spain 2012

Voldemort  

Distributed  key/value  store  

Page 19: Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Data Spain 2012

Voldemort  &  Hadoop  

Benefits  ü  Scalability  &  failover  ü  UpdaNng  the  database  does  not  affect  serving  queries  ü  All  data  is  replaced  at  each  execuNon  

•  Providing  agility/flexibility    §  Big  development  changes  are  not  a  pain  

•  Easier  survival  to  human  errors  §  Fix  code  and  run  again  

•  Easy  to  set  up  new  clusters  with  different  topologies    

Page 20: Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Data Spain 2012

Basic  staNsNcs  

Count   Average   Min   Max   Stdev  

Easy  to  implement  with  Pangool/Hadoop  ü  One  job,  grouping  by  the  dimension  over  which  you  want  to  

calculate  the  staNsNcs.  

CompuJng  several  Jme  periods  in  the  same  job  

ü  Use  the  mapper  for  replicaNng  each  datum  for  each  period  ü  Add  a  period  idenNfier  field  in  the  tuple  and  include  it  in  the  

group  by  clause    

Page 21: Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Data Spain 2012

DisNnct  count  Possible  to  compute  in  a  single  job  

ü  Using  secondary  sorNng  by  the  field  you  want  to  disNnct  count  on  

ü  DetecNng  changes  on  that  field    

Example  

Shop   Card  

Shop  1   1234  

Shop  1   1234  

Shop  1   1234  

Shop  1   5678  

Shop  1   5678  

Change  +1  

Change  +1  

2  disNnct  buyers  for  shop  1  

ü  Group  by  shop,  sort  by  shop  and  card  

Page 22: Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Data Spain 2012

Histograms  Typically  two-­‐pass  algorithm  

ü  First  pass  for  detecNng  the  minimum  and  the  maximum  and  determine  the  bins  ranges  

ü  Second  pass  to  count  the  number  of  occurrences  on  each  bin  

AdaptaJve  histogram    

ü  One  pass  ü  Fixed  number  of  bins  ü  Bins  adapt    

Page 23: Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Data Spain 2012

OpNmal  histogram  Calculate  the  beCer  histogram  that  represents  the  original  one  using  a  limited  number  of  flexible  width  bins  

ü  Reduce  storage  needs  ü More  representaNve  than  fixed  width  ones  -­‐>  beXer  

visualizaNon  

Page 24: Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Data Spain 2012

OpNmal  histogram  

Exact  Algorithm  Petri  Kontkanen,  Petri  Myllym  aki    MDL  Histogram  Density  EsJmaJon    hXp://eprints.pascal-­‐network.org/archive/00002983/  

Too  slow  for  producJon  use  

Page 25: Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Data Spain 2012

OpNmal  histogram  

AlternaNve:  Approximated  algorithm  

Random-­‐restart  hill  climbing    

1.  Iterate  N  Nmes,  keeping  best  soluNon  1.  Generate  a  random  soluNon  2.  Iterate  unNl  no  improvement  

1.  Move  to  next  beXer  possible  movement  

ü  A  soluNon  is  just  a  way  of  grouping  exisNng  bins  ü  From  a  soluNon,  you  can  move  to  some  close  

soluNons  ü  Some  are  beXer:  reduce  the  representaNon  error    

Algorithm  

Page 26: Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Data Spain 2012

OpNmal  histogram  

AlternaNve:  Approximated  algorithm  

Random-­‐restart  hill  climbing    ü  One  order  of  magnitude  faster  ü  99%  accuracy    

Page 27: Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Data Spain 2012

Everything  in  one  job  

Basic  staJsJcs  -­‐>  1  job  

DisJnct  count  staJsJcs  -­‐>  1  job  One  pass  histograms  -­‐>  1  job  Several  periods  &  shops  -­‐>  1  job  

We  can  put  all  together  so  that  compuNng  all  staNsNcs  for  all  shops  

fits  into  exactly  one  job      

Page 28: Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Data Spain 2012

Shop  recommendaNons  

Based  on  co-­‐occurrences  ü  If  somebody  bought  in  shop  A  and  in  shop  B,  then  a  co-­‐occurrence  

between  A  and  B  exists  ü Only  one  co-­‐occurrence  is  considered  although  a  buyer  bought  

several  Nmes  in  A  and  B  ü  Top  co-­‐occurrences  per  each  shop  are  the  recommendaNons  

Improvements  ü Most  popular  shops  are  filtered  out  because  almost  everybody  buys  

in  them.  ü  RecommendaNons  by  category,  by  locaJon  and  by  both  ü  Different  calculaNon  periods  

Page 29: Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Data Spain 2012

Shop  recommendaNons  

Implemented  in  Pangool  ü  Using  its  counNng  and  joining  capabiliNes  ü  Several  jobs  

Challenges  ü  If  somebody  bought    in  many  shops,  the  list  of  co-­‐occurrences  can  

explode:  •  Co-­‐occurrences  =  N  *  (N  –  1),  where  N  =  #  of  disNnct  shops  

where  the  person  bought  ü  Alleviated  by  limiNng  the  total  number  of  disNnct  shops  to  consider  

ü  Only  uses  the  top  M  shops  where  the  client  bought  the  most    

Future  ü  Time  aware  co-­‐occurrences.  The  client  bought  in  A  and  B  and  he  

did  it  in  a  close  period  of  Nme.  

Page 30: Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Data Spain 2012

Some  numbers  EsJmated  resources  needed  with  1  year  data  

270  GB  of  stats  to  serve  

24  large  instances  ~  11  hours  of  execuNon  

$3500  month  ü  OpNmizaNons  sNll  possible  ü  Cost  without  the  use  of  reserved  instances  ü  Probably  cheaper  with  an  in-­‐house  Hadoop  cluster  

Page 31: Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Data Spain 2012

Conclusion  It  was  possible  to  develop  a  Big  Data  soluJon  for  a  Bank  

ü With  low  use  of  resources  ü Quickly  ü  Thanks  to  the  use  of  technologies  like  Hadoop,  Amazon  Web  

Services  and  NoSQL  databases  

The  soluJon  is  ü  Scalable  ü  Flexible/agile.  Improvements  easy  to  implement  ü  Prepared  to  stand  human  failures  ü  At  a  reasonable  cost  

Main  advantage:  doing  always  everything  

Page 32: Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Data Spain 2012

Future:  Splout  Key/value  datastores  have  limitaJons  

ü  Only  accept  querying  by  the  key  ü  AggregaNons  no  possible  ü  In  other  words,  we  are  forced  to  pre-­‐compute  everything  

ü  Not  always  possible  -­‐>  data  explode  ü  For  this  parNcular  case,  Nme  ranges  are  fixed  

Splout:  like  Voldemort  but  SQL!  ü  The  idea:  to  replace  Voldemort  by  Splout  SQL  ü  Much  richer  queries:  real-­‐Nme  aggregaNons,  flexible  Nme  ranges  ü  It  would  allow  to  create  some  kind  of  Google  AnalyNcs  for  the  

staNsNcs  discussed  in  this  presentaNon  ü  Open  Sourced!!!  

hXps://github.com/datasalt/splout-­‐db    

Page 33: Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Data Spain 2012

Iván  de  Prado  Alonso  –  CEO  of  Datasalt  www.datasalt.es  @ivanprado  @datasalt  

QuesJons?