big and fast a quest for relevant and real-time analytics
Post on 05-Dec-2014
1.050 Views
Preview:
DESCRIPTION
TRANSCRIPT
Big & Fast: A quest for relevant and real-time analytics
Natalino Busa@natalinobusa
Parallelism Mathematics Programming
Languages Machine Learning Statistics
Big Data Algorithms Cloud Computing
Natalino Busa@natalinobusa
www.natalinobusa.com
Big and Fast. Methodology Architecture Roles and organization
Conversion is the ultimate form of permission marketing
Permission marketing is about the honour of being heard.
How to earn it ? Provide the right suggestions, at the right time. This is what makes data analysis valuable
When do you really know your customer ?
know about last unique:
5 songs?
100 songs?
10’000 songs?
Old & New stuff.
We evolve slowly, our personality, our habits.
But events and trends can affect us on a short notice
How do you combine old with new?
The customer’s contextComplex on many dimensions:
Personal history: amount of transactions ever done
Long term Interaction:how the users’ action correlate with others
Real time events:Trends and recent events
The customer’s context
context is related to time:
slow changing: the defining characteristic of a person
fast changing: events which influence our lives, trends
Require very different technology solutions !!!
Challenges
millions of billions of
Not much time to reactwindow of opportunity sometimes is just a few seconds
Load of information to processyou want to understand well the user history
Slow and fast
ranking and preference analysis
segmentation and clustering
short term trending topics
rule-based recommendations
10’s Terabytes of Data. This can take hours ….
100’s of events per second.This must be fast ….
Hadoop: Distributed Data OS
ReliableDistributed, Replicated File System
Low cost↓ Cost vs ↑ Performance/Storage
Computing Powerhouse
All clusters CPU’s working in parallel for running queries
Scala / Akka / Spray: a WEB API reactive framework
ActorA Actor
B
ActorC
msg 1msg 2
msg 3
msg 4● it scales horizontally (can run in cluster mode)
● maximum use of the available cores/memory
1. processing is non-blocking, threads are re-used
2. can parallelize computing power across many actors
Very fast: 1000’s messages/sec
Very reliable: auto recovery
Distributed computing: lambda architecture
BatchComputing
HTTP RESTful API
In-MemoryDistributed Database
In-memoryDistributed DB’s
Lambda ArchitectureBatch + Streaming
low-latencyWeb API services
StreamingComputing
Data Warehouses Messaging Busses
Distributed computing: some techs
Hadoop
Cassandra
millions of billions of
λ= conversions
( lamda )
All Things Distributed
Distributing computing and storage
more machines = more storage/computing
Open Source software solutions
mature enough for pragmatic adopters
Near realtime + big data technologies
Hadoop, Scala, Akka, Spray, Cassandra
Science & Engineering
Statistics, Data Science
PythonRVisualization
IT InfraBig Data
JavaScalaSQL
Hadoop: Big Data Infrastructure, Data Science on large datasets
Big Data and Fast Data requires different profiles to be able to achieve the best results
Parallelism Mathematics Programming
Languages Machine Learning Statistics
Big Data Algorithms Cloud Computing
Natalino Busa@natalinobusa
www.natalinobusa.com
Thanks !Any questions?
Natalino Busa@natalinobusa
top related