big and fast a quest for relevant and real-time analytics
DESCRIPTION
Our retail banking market demands now more than ever to stay close to our customers, and to carefully understand what services, products, and wishes are relevant for each customer at any given time. This sort of marketing research is often beyond the capacity of traditional BI reporting frameworks. In this talk, we illustrate how we team up data scientists and big data engineers in order to create and scale distributed analyses on a big data platform.TRANSCRIPT
Big & Fast: A quest for relevant and real-time analytics
Natalino Busa@natalinobusa
Parallelism Mathematics Programming
Languages Machine Learning Statistics
Big Data Algorithms Cloud Computing
Natalino Busa@natalinobusa
www.natalinobusa.com
Big and Fast. Methodology Architecture Roles and organization
Conversion is the ultimate form of permission marketing
Permission marketing is about the honour of being heard.
How to earn it ? Provide the right suggestions, at the right time. This is what makes data analysis valuable
When do you really know your customer ?
know about last unique:
5 songs?
100 songs?
10’000 songs?
Old & New stuff.
We evolve slowly, our personality, our habits.
But events and trends can affect us on a short notice
How do you combine old with new?
The customer’s contextComplex on many dimensions:
Personal history: amount of transactions ever done
Long term Interaction:how the users’ action correlate with others
Real time events:Trends and recent events
The customer’s context
context is related to time:
slow changing: the defining characteristic of a person
fast changing: events which influence our lives, trends
Require very different technology solutions !!!
Challenges
millions of billions of
Not much time to reactwindow of opportunity sometimes is just a few seconds
Load of information to processyou want to understand well the user history
Slow and fast
ranking and preference analysis
segmentation and clustering
short term trending topics
rule-based recommendations
10’s Terabytes of Data. This can take hours ….
100’s of events per second.This must be fast ….
Hadoop: Distributed Data OS
ReliableDistributed, Replicated File System
Low cost↓ Cost vs ↑ Performance/Storage
Computing Powerhouse
All clusters CPU’s working in parallel for running queries
Scala / Akka / Spray: a WEB API reactive framework
ActorA Actor
B
ActorC
msg 1msg 2
msg 3
msg 4● it scales horizontally (can run in cluster mode)
● maximum use of the available cores/memory
1. processing is non-blocking, threads are re-used
2. can parallelize computing power across many actors
Very fast: 1000’s messages/sec
Very reliable: auto recovery
Distributed computing: lambda architecture
BatchComputing
HTTP RESTful API
In-MemoryDistributed Database
In-memoryDistributed DB’s
Lambda ArchitectureBatch + Streaming
low-latencyWeb API services
StreamingComputing
Data Warehouses Messaging Busses
Distributed computing: some techs
Hadoop
Cassandra
millions of billions of
λ= conversions
( lamda )
All Things Distributed
Distributing computing and storage
more machines = more storage/computing
Open Source software solutions
mature enough for pragmatic adopters
Near realtime + big data technologies
Hadoop, Scala, Akka, Spray, Cassandra
Science & Engineering
Statistics, Data Science
PythonRVisualization
IT InfraBig Data
JavaScalaSQL
Hadoop: Big Data Infrastructure, Data Science on large datasets
Big Data and Fast Data requires different profiles to be able to achieve the best results
Parallelism Mathematics Programming
Languages Machine Learning Statistics
Big Data Algorithms Cloud Computing
Natalino Busa@natalinobusa
www.natalinobusa.com
Thanks !Any questions?
Natalino Busa@natalinobusa