big data real time architectures

Big Data Real Time ArchitecturesLambda, Kappa motivation and practical applications

@dmarcous

Problems

Volume

Variety Velocity

Solutions

Batch processing

NoSQL Streamprocessing

More Problems?

● Machines FAIL● Humans make mistakes● We want everything in real time!

○ We can’t do everything in real time :(● We might think of a new way to analyse old data● We might want to take a look of older versions of the raw / aggregated data● Looking at raw data is cool, looking at aggregated data is cooler, looking at indexed/ data with

ad-hoc filter is the coolest. What if we want them all on the same set of data?

� Batch processing◦ Large amount of static data◦ Scalable solution◦ Volume

� Real-time processing◦ Computing streaming data◦ Low latency◦ Velocity

� Hybrid computation◦ Lambda Architecture◦ Kappa Architecture

Big Data Timeline2006

2010

1st Generation

2003Inception

2nd Generation

2012

3rd Generation

● Nathan Marz (Twitter)● How to beat the CAP theorem

○ http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html

Lambda Architecture

● Concepts :○ Immutable data○ Everything can be re-run○ Using the best tool for purpose○ Query = Function(All Data)○ real time isn’t accurate, batch will

fix any mistakes● Layers

○ Batch○ Speed○ Serving

http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html



Lambda Architecture is:A complementary pair of:- in-memory real-time processing- large HDD/SSD batch processing

Proposed by Nathan Marz

Slow, but large and persistent.

Fast, but small and volatile.

● Data duplication○ Columnar + Compressed○ Don’t be cheap...

● Too many tools!○ Stay on 1 platform - Hadoop/YARN

● Do I really need to write everything twice? (Cross DB ORM)○ Frameworks

■ Twitter Summingbird (MR + Storm)■ Apache Spark (batch / Streaming)■ Google Dataflow

● No place for ad-hoc analysis○ Add more specialised data sources

■ Solr / Elasticsearch● Incremental Algorithms are HARD - stream process based on smart thresholds (= history)

○ Mix it up - Key value access during speed process● A new event may be related to an old one, that might be realted to an older one…

○(Add graph processing (GraphX/ Giraph/ Titan

Lambda Pitfalls

● Jay Kreps (LinkedIn)● Questioning The Lambda Architecture

○ http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html● Concepts

○ retain the full log of the data○ processing = new instance of the same stream○ input - choose where to start reading from the log (now, 1 day ago, 1 year ago..)○ real time is accurate!○ re-processing only when code changes

Kappa Architecture

http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html

http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html

Lambda Kappa

● Different● Common○ Greek letters○ Real time processing at scale○ Immutable Architectures

■ “Replay” possible○ Born out of need○ Both use Materialised views /

indexed results for serving

Lambda Kappa

Lambda Kappa

Processing Paradigm

Batch + Streaming

Streaming

Re-processing Paradigm

Every Batch Cycle

Only when code changes

Reliability Batch is reliable, streaming is approximate

Streaming with consistency

(exactly once)

Resource Consumption

Function = Query(All data)

Incremental algorithms,

running on deltas

● Data Ingestion○ Kafka○ Apache Flume○ Samza

● Batch○ MR (Hive, Pig etc.)○ Tez○ Spark○ Dataflow (=Google Flume)

● Stream○ Storm○ Spark Streaming○ Samza○ Dataflow (=Google Flume)○ Flink

Tooling

● Serving○ DBs

■ ElephantDB■ SploutSQL■ HBase /

Cassandra○ Queries

■ Impala■ Presto■ Big Query

● Lambdas○ Twitter○ Spotify (music recommendations)○ Liveperson○ Inneractive

● Kappas○ LinkedIn○ Yahoo

● Platforms○ Oryx2 (Cloudera)

■ Lambda ML Platform using Kafka + Spark○ Novelti.io (Previously Lambdoop)

■ Streaming intelligence for everything (mainly IoT)

Users

● Zeta Architecture○ Includes cluster management

■ Monitoring■ Scheduling■ Container system etc.

○ Inspired by Google● iot-a

○ Internet of Things○ Layered

■ MQ (kafka - RT)■ DB (HBase - Interactive)■ DFS (Batch)

● Mu Architecture○Lambda with only 1 set of aggregated views

?More Architectures

● lambda○ http://www.infoq.com/interviews/marz-lambda-architecture

● Kappa○ http://www.kappa-architecture.com/

Appendix - Videos

big data real time architectures

Data & Analytics