value extraction from bbva credit card transactions. ivan de prado at big data spain 2012

Value extraction from BBVA credit card transactions

Iván de Prado Alonso – CEO of Datasalt www.datasalt.es @ivanprado @datasalt

www.bigdataspain.org November 16th, 2012 ETSI Telecomunicación Madrid Spain #BDSpain

BIG “MAC” DATA

104,000 employees 47 million customers

The idea

Extract value from

anonymized credit card transacNons data & share it

Always: ü  Impersonal ü  Aggregated ü  Dissociated ü  Irreversible

Helping

Consumers

Sellers

Informed decision ü  Shop recommendaNons (by locaNon and by category) ü  Best Nme to buy ü  AcNvity & fidelity of shop’s customers

Learning clients paCerns ü  AcNvity & fidelity of shop’s customers ü  Sex & Age & LocaNon ü  Buying paXerns

Shop stats For different periods ü  All, year, quarter, month, week, day

… and much more

The applicaNons

Customers

Internal use

Sellers

The challenges

Company silos

The amount of data

The costs

Security

Development flexibility/agility

Human failures

The pla]orm

S3 Data storage ElasNc Map Reduce Data processing

EC2 Data serving

The architecture

Hadoop

Distributed Filesystem ü  Files as big as you want ü  Horizontal scalability ü  Failover

Distributed CompuNng ü  MapReduce ü  Batch oriented

•  Input files processed and converted in output files ü  Horizontal scalability

Easier Hadoop Java API ü  But keeping similar efficiency

Common design paXerns covered ü  Compound records ü  Secondary sorNng ü  Joins

Other improvements ü  Instance based configuraNon ü  First class mulNple input/output

Tuple MapReduce implementaJon for Hadoop

Tuple MapReduce

Pere Ferrera, Iván de Prado, Eric Palacios, Jose Luis Fernandez-‐Marquez, Giovanna Di Marzo Serugendo: Tuple MapReduce: Beyond classic MapReduce. In ICDM 2012: Proceedings of the IEEE Interna2onal Conference on Data Mining Brussels, Belgium | December 10 – 13, 2012

Our evoluJon to Google’s MapReduce

Tuple MapReduce Sales difference between the most selling offices per each loca2on

Tuple MapReduce

Main constraint

ü  Group by clause must be a subset of sort by clause

Indeed, Tuple MapReduce can be implemented on top of any MapReduce implementaJon

•  Pangool -‐> Tuple MapReduce over Hadoop

Efficiency

hXp://pangool.net/benchmark.html

Similar efficiency to Hadoop

Voldemort

Distributed key/value store

Voldemort & Hadoop

Benefits ü  Scalability & failover ü  UpdaNng the database does not affect serving queries ü  All data is replaced at each execuNon

•  Providing agility/flexibility §  Big development changes are not a pain

•  Easier survival to human errors §  Fix code and run again

•  Easy to set up new clusters with different topologies

Basic staNsNcs

Count Average Min Max Stdev

Easy to implement with Pangool/Hadoop ü  One job, grouping by the dimension over which you want to

calculate the staNsNcs.

CompuJng several Jme periods in the same job

ü  Use the mapper for replicaNng each datum for each period ü  Add a period idenNfier field in the tuple and include it in the

group by clause

DisNnct count Possible to compute in a single job

ü  Using secondary sorNng by the field you want to disNnct count on

ü  DetecNng changes on that field

Example

Shop Card

Shop 1 1234

Shop 1 1234

Shop 1 1234

Shop 1 5678

Shop 1 5678

Change +1

Change +1

2 disNnct buyers for shop 1

ü  Group by shop, sort by shop and card

Histograms Typically two-‐pass algorithm

ü  First pass for detecNng the minimum and the maximum and determine the bins ranges

ü  Second pass to count the number of occurrences on each bin

AdaptaJve histogram

ü  One pass ü  Fixed number of bins ü  Bins adapt

OpNmal histogram Calculate the beCer histogram that represents the original one using a limited number of flexible width bins

ü  Reduce storage needs ü More representaNve than fixed width ones -‐> beXer

visualizaNon

OpNmal histogram

Exact Algorithm Petri Kontkanen, Petri Myllym aki MDL Histogram Density EsJmaJon hXp://eprints.pascal-‐network.org/archive/00002983/

Too slow for producJon use

OpNmal histogram

AlternaNve: Approximated algorithm

Random-‐restart hill climbing

1.  Iterate N Nmes, keeping best soluNon 1.  Generate a random soluNon 2.  Iterate unNl no improvement

1.  Move to next beXer possible movement

ü  A soluNon is just a way of grouping exisNng bins ü  From a soluNon, you can move to some close

soluNons ü  Some are beXer: reduce the representaNon error

Algorithm

OpNmal histogram

AlternaNve: Approximated algorithm

Random-‐restart hill climbing ü  One order of magnitude faster ü  99% accuracy

Everything in one job

Basic staJsJcs -‐> 1 job

DisJnct count staJsJcs -‐> 1 job One pass histograms -‐> 1 job Several periods & shops -‐> 1 job

We can put all together so that compuNng all staNsNcs for all shops

fits into exactly one job

Shop recommendaNons

Based on co-‐occurrences ü  If somebody bought in shop A and in shop B, then a co-‐occurrence

between A and B exists ü Only one co-‐occurrence is considered although a buyer bought

several Nmes in A and B ü  Top co-‐occurrences per each shop are the recommendaNons

Improvements ü Most popular shops are filtered out because almost everybody buys

in them. ü  RecommendaNons by category, by locaJon and by both ü  Different calculaNon periods

Shop recommendaNons

Implemented in Pangool ü  Using its counNng and joining capabiliNes ü  Several jobs

Challenges ü  If somebody bought in many shops, the list of co-‐occurrences can

explode: •  Co-‐occurrences = N * (N – 1), where N = # of disNnct shops

where the person bought ü  Alleviated by limiNng the total number of disNnct shops to consider

ü  Only uses the top M shops where the client bought the most

Future ü  Time aware co-‐occurrences. The client bought in A and B and he

did it in a close period of Nme.

Some numbers EsJmated resources needed with 1 year data

270 GB of stats to serve

24 large instances ~ 11 hours of execuNon

$3500 month ü  OpNmizaNons sNll possible ü  Cost without the use of reserved instances ü  Probably cheaper with an in-‐house Hadoop cluster

Conclusion It was possible to develop a Big Data soluJon for a Bank

ü With low use of resources ü Quickly ü  Thanks to the use of technologies like Hadoop, Amazon Web

Services and NoSQL databases

The soluJon is ü  Scalable ü  Flexible/agile. Improvements easy to implement ü  Prepared to stand human failures ü  At a reasonable cost

Main advantage: doing always everything

Future: Splout Key/value datastores have limitaJons

ü  Only accept querying by the key ü  AggregaNons no possible ü  In other words, we are forced to pre-‐compute everything

ü  Not always possible -‐> data explode ü  For this parNcular case, Nme ranges are fixed

Splout: like Voldemort but SQL! ü  The idea: to replace Voldemort by Splout SQL ü  Much richer queries: real-‐Nme aggregaNons, flexible Nme ranges ü  It would allow to create some kind of Google AnalyNcs for the

staNsNcs discussed in this presentaNon ü  Open Sourced!!!

hXps://github.com/datasalt/splout-‐db

Iván de Prado Alonso – CEO of Datasalt www.datasalt.es @ivanprado @datasalt

QuesJons?

value extraction from bbva credit card transactions. ivan de prado at big data spain 2012

Technology