big data real time architectures
TRANSCRIPT
More Problems?
● Machines FAIL● Humans make mistakes● We want everything in real time!
○ We can’t do everything in real time :(● We might think of a new way to analyse old data● We might want to take a look of older versions of the raw / aggregated data● Looking at raw data is cool, looking at aggregated data is cooler, looking at indexed/ data with
ad-hoc filter is the coolest. What if we want them all on the same set of data?
� Batch processing◦ Large amount of static data◦ Scalable solution◦ Volume
� Real-time processing◦ Computing streaming data◦ Low latency◦ Velocity
� Hybrid computation◦ Lambda Architecture◦ Kappa Architecture
Big Data Timeline2006
2010
1st Generation
2003Inception
2nd Generation
2012
3rd Generation
● Nathan Marz (Twitter)● How to beat the CAP theorem
○ http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html
Lambda Architecture
● Concepts :○ Immutable data○ Everything can be re-run○ Using the best tool for purpose○ Query = Function(All Data)○ real time isn’t accurate, batch will
fix any mistakes● Layers
○ Batch○ Speed○ Serving
Lambda Architecture is:A complementary pair of:- in-memory real-time processing- large HDD/SSD batch processing
Proposed by Nathan Marz
Slow, but large and persistent.
Fast, but small and volatile.
● Data duplication○ Columnar + Compressed○ Don’t be cheap...
● Too many tools!○ Stay on 1 platform - Hadoop/YARN
● Do I really need to write everything twice? (Cross DB ORM)○ Frameworks
■ Twitter Summingbird (MR + Storm)■ Apache Spark (batch / Streaming)■ Google Dataflow
● No place for ad-hoc analysis○ Add more specialised data sources
■ Solr / Elasticsearch● Incremental Algorithms are HARD - stream process based on smart thresholds (= history)
○ Mix it up - Key value access during speed process● A new event may be related to an old one, that might be realted to an older one…
○(Add graph processing (GraphX/ Giraph/ Titan
Lambda Pitfalls
● Jay Kreps (LinkedIn)● Questioning The Lambda Architecture
○ http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html● Concepts
○ retain the full log of the data○ processing = new instance of the same stream○ input - choose where to start reading from the log (now, 1 day ago, 1 year ago..)○ real time is accurate!○ re-processing only when code changes
Kappa Architecture
● Different● Common○ Greek letters○ Real time processing at scale○ Immutable Architectures
■ “Replay” possible○ Born out of need○ Both use Materialised views /
indexed results for serving
Lambda Kappa
Lambda Kappa
Processing Paradigm
Batch + Streaming
Streaming
Re-processing Paradigm
Every Batch Cycle
Only when code changes
Reliability Batch is reliable, streaming is approximate
Streaming with consistency
(exactly once)
Resource Consumption
Function = Query(All data)
Incremental algorithms,
running on deltas
● Data Ingestion○ Kafka○ Apache Flume○ Samza
● Batch○ MR (Hive, Pig etc.)○ Tez○ Spark○ Dataflow (=Google Flume)
● Stream○ Storm○ Spark Streaming○ Samza○ Dataflow (=Google Flume)○ Flink
Tooling
● Serving○ DBs
■ ElephantDB■ SploutSQL■ HBase /
Cassandra○ Queries
■ Impala■ Presto■ Big Query
● Lambdas○ Twitter○ Spotify (music recommendations)○ Liveperson○ Inneractive
● Kappas○ LinkedIn○ Yahoo
● Platforms○ Oryx2 (Cloudera)
■ Lambda ML Platform using Kafka + Spark○ Novelti.io (Previously Lambdoop)
■ Streaming intelligence for everything (mainly IoT)
Users
● Zeta Architecture○ Includes cluster management
■ Monitoring■ Scheduling■ Container system etc.
○ Inspired by Google● iot-a
○ Internet of Things○ Layered
■ MQ (kafka - RT)■ DB (HBase - Interactive)■ DFS (Batch)
● Mu Architecture○Lambda with only 1 set of aggregated views
?More Architectures