webinar - big data: let's smack - jorg schad
TRANSCRIPT
Big Data: Let's SMACK @joerg_schad
You can come and see me at Codemotion Milan 2017!
Talk: NO ONE PUTS Java IN THE
CONTAINER November 10th, 15:10-15.50
http://milan2017.codemotionworld.com/
© 2017 Mesosphere, Inc. All Rights Reserved. 3
Jörg SchadSoftware Engineer @Mesosphere
@joerg_schad
@joerg.mesosphere
© 2017 Mesosphere, Inc. All Rights Reserved. 4
MapReduce is crunching Data
Ancient Times...
© 2016 Mesosphere, Inc. All Rights Reserved. 5
But then business demanded FAST DATA
We need to turn faster!
Today...
Batch Event ProcessingMicro-Batch
Days Hours Minutes Seconds Microseconds
Solves problems using predictive and prescriptive analyticsReports what has happened using descriptive analytics
Predictive User InterfaceReal-time Pricing and Routing Real-time AdvertisingBilling, Chargeback Product recommendations
(Fast) Data Processing
Fast Data Pipeline
EVENTS
Ubiquitous data streams from connected devices
INGEST STOREANALYZE ACT
Ingest millions of events per second
Distributed & highly scalable database
Real-time and batch process data
Visualize data and build data driven
applications
The SMACK Stack
EVENTS
Ubiquitous data streams from connected devices
INGEST
Apache Kafka
STORE
Apache Spark
ANALYZE
Apache Cassandra
ACT
Akka
Ingest millions of events per second
Distributed & highly scalable database
Real-time and batch process data
Visualize data and build data driven
applications
Apache Mesos
Sensors
Devices
Clients
© 2017 Mesosphere, Inc. All Rights Reserved. 9
Datacenter
NAIVE APPROACH
Typical Datacentersiloed, over-provisioned servers,
low utilization
Industry Average 12-15% utilization
mySQL
microservice
Cassandra
Spark/Hadoop
Kafka
© 2017 Mesosphere, Inc. All Rights Reserved. 11
MULTIPLEXING OF DATA, SERVICES, USERS, ENVIRONMENTS
Typical Datacentersiloed, over-provisioned servers,
low utilization
Mesos/ DC/OSautomated schedulers, workload multiplexing onto the
same machines
mySQL
microservice
Cassandra
Spark/Hadoop
Kafka
Apache Mesos• A top-level Apache project• A cluster resource negotiator• Scalable to 10,000s of nodes• Fault-tolerant, battle-tested• An SDK for distributed apps• Native Docker support
MESOS: FUNDAMENTAL ARCHITECTURE
Mesos Master
Mesos Master
Mesos Master
Mesos AgentMesos Agent ServiceCassandra Executor
Cassandra Task
Cassandra Scheduler
Container Scheduler
Spark Scheduler
Spark Executor
Spark Task
Mesos AgentMesos Agent ServiceDocker
Executor
Docker Task
Spark Executor
Spark Task
Two-level Scheduling1. Agents advertise resources to Master2. Master offers resources to Framework3. Framework rejects / uses resources4. Agent reports task status to Master
© 2017 Mesosphere, Inc. All Rights Reserved. 15
Datacenter Operating System (DC/OS)
Distributed Systems Kernel (Mesos)
DC/OS ENABLES MODERN DISTRIBUTED APPS
Big Data + Analytics EnginesMicroservices (in containers)
Streaming
Batch
Machine Learning
Analytics
Functions & Logic
Search
Time Series
SQL / NoSQL
Databases
Modern App Components
Any Infrastructure (Physical, Virtual, Cloud)
The SMACK Stack
EVENTS
Ubiquitous data streams from connected devices
INGEST
Apache Kafka
STORE
Apache Spark
ANALYZE
Apache Cassandra
ACT
Akka
Ingest millions of events per second
Distributed & highly scalable database
Real-time and batch process data
Visualize data and build data driven
applications
Apache Mesos
Sensors
Devices
Clients
The SMACK Stack
EVENTS
Ubiquitous data streams from connected devices
INGEST STOREANALYZE ACT
Ingest millions of events per second
Distributed & highly scalable database
Real-time and batch process data
Visualize data and build data driven
applications
The SMACK Stack
EVENTS
Ubiquitous data streams from connected devices
INGEST STOREANALYZE ACT
Ingest millions of events per second
Distributed & highly scalable database
Real-time and batch process data
Visualize data and build data driven
applications
DATA PROCESSING AT HYPERSCALE
EVENTS
Ubiquitous data streams from connected devices
INGEST
Apache Kafka
STORE
Apache Spark
ANALYZE
Apache Cassandra
ACT
Akka
Ingest millions of events per second
Distributed & highly scalable database
Real-time and batch process data
Visualize data and build data driven
applications
DC/OS
Sensors
Devices
Clients
MESSAGE QUEUES
Apache Kafka ØMQ, RabbitMQ, Disque (Redis-based), etc. fluentd, Logstash, Flume Akka streams cloud-only:
AWS SQS Google Cloud Pub/Sub
APACHE KAFKA
High-throughput, distributed, persistent publish-subscribe messaging system
Originates from LinkedIn Typically used as buffer/de-coupling
layer in online stream processing
fluentd
© 2016 Mesosphere, Inc. All Rights Reserved. 24
● Scalability● Message Type● Log vs …
● Delivery Guarantees/Message durability
● Routing Capabilities● Failover● Community● Mesos Support ;-)
HOW TO CHOOSE?
DELIVERY GUARANTEES
At most once—Messages may be lost but are never redelivered.
At least once—Messages are never lost but may be redelivered.
Exactly once—this is what people actually want, each message is delivered once and only once.
Murphy’s Law of Distributed Systems:
Anything that can go wrong, will go wrong … partially!
RoutingSimple Pipes Routing
DATA PROCESSING AT HYPERSCALE
EVENTS
Ubiquitous data streams from connected devices
INGEST
Apache Kafka
STORE
Apache Spark
ANALYZE
Apache Cassandra
ACT
Akka
Ingest millions of events per second
Distributed & highly scalable database
Real-time and batch process data
Visualize data and build data driven
applications
DC/OS
Sensors
Devices
Clients
STREAM PROCESSING• Apache Storm • Apache Spark • Apache Samza • Apache Flink • Apache Apex • Concord • cloud-only: AWS Kinesis,
Google Cloud Dataflow
© 2016 Mesosphere, Inc. All Rights Reserved. 29
APACHE SPARK
APACHE SPARK (STREAMING)
Typical Use: distributed, large-scale data processing; micro-batchingWhy Spark Streaming?
• Micro-batching creates very low latency, which can be faster
• Well defined role means it fits in well with other pieces of the pipeline
© 2016 Mesosphere, Inc. All Rights Reserved. 31
● Execution Model● Native Streaming vs Microbatch
● Fault Tolerance Granularity● Per record, per batch
● Delivery Guarantees● API
● SQL● Spark
● Performance….● Realtime ≠ Realtime
● Community● Mesos Support ;-)
HOW TO CHOOSE?
EXECUTION MODELMicro-Batching Native
Streaming
FAULT TOLERANCECheckpoint per “Batch”Ack-Per-Record Checkpoint per Batch
DELIVERY GUARANTEES“Exactly once”At least Once
DATA PROCESSING AT HYPERSCALE
EVENTS
Ubiquitous data streams from connected devices
INGEST
Apache Kafka
STORE
Apache Spark
ANALYZE
Apache Cassandra
ACT
Akka
Ingest millions of events per second
Distributed & highly scalable database
Real-time and batch process data
Visualize data and build data driven
applications
DC/OS
Sensors
Devices
Clients
Datastores
Data ModelKey-Value GraphRelational Document
● Schema
● SQL
● Foreign Keys/Joins
● OLTP/OLAP
● Simple
● Scalable
● Cache
FilesTime-Series● Complex
relations
● Social Graph
● Recommendation
● Fraud detections
● Schema-Less
● Semi-structured queries
● Product catalogue
● Session data
Demo Time
Generator Display
1. Financial data created by generator
2. Written to Kafka topics
3. Kafka Topics consumed by Flink
4. Flink pipeline operates on Kafka data
5. Results written back into Kafka stream (another topic)
6. Results displayed
© 2017 Mesosphere, Inc. All Rights Reserved. 39
Keep it running!
SERVICE OPERATIONS
● Configuration Updates (ex: Scaling, re-configuration) ● Binary Upgrades ● Cluster Maintenance (ex: Backup, Restore, Restart) ● Monitor progress of operations ● Debug any runtime blockages
CODEMOTION MILAN November 10/11th,2017http://milan2017.codemotionworld.com/
See you at