webinar - big data: let's smack - jorg schad

Big Data: Let's SMACK @joerg_schad

You can come and see me at Codemotion Milan 2017!

Talk: NO ONE PUTS Java IN THE

CONTAINER November 10th, 15:10-15.50

http://milan2017.codemotionworld.com/


© 2017 Mesosphere, Inc. All Rights Reserved. 3

Jörg SchadSoftware Engineer @Mesosphere

@joerg_schad

@joerg.mesosphere


MapReduce is crunching Data

Ancient Times...


But then business demanded FAST DATA

We need to turn faster!

Today...

Batch Event ProcessingMicro-Batch

Days Hours Minutes Seconds Microseconds

Solves problems using predictive and prescriptive analyticsReports what has happened using descriptive analytics

Predictive User InterfaceReal-time Pricing and Routing Real-time AdvertisingBilling, Chargeback Product recommendations

(Fast) Data Processing

Fast Data Pipeline

EVENTS

Ubiquitous data streams from connected devices

INGEST STOREANALYZE ACT

Ingest millions of events per second

Distributed & highly scalable database

Real-time and batch process data

Visualize data and build data driven

applications

The SMACK Stack

EVENTS


INGEST

Apache Kafka

STORE

Apache Spark

ANALYZE

Apache Cassandra

ACT

Akka





applications

Apache Mesos

Sensors

Devices

Clients


Datacenter

NAIVE APPROACH

Typical Datacentersiloed, over-provisioned servers,

low utilization

Industry Average 12-15% utilization

mySQL

microservice

Cassandra

Spark/Hadoop

Kafka

MULTIPLEXING OF DATA, SERVICES, USERS, ENVIRONMENTS

Typical Datacentersiloed, over-provisioned servers,

low utilization

Mesos/ DC/OSautomated schedulers, workload multiplexing onto the

same machines

mySQL

microservice

Cassandra

Spark/Hadoop

Kafka

Apache Mesos• A top-level Apache project• A cluster resource negotiator• Scalable to 10,000s of nodes• Fault-tolerant, battle-tested• An SDK for distributed apps• Native Docker support

MESOS: FUNDAMENTAL ARCHITECTURE

Mesos Master

Mesos Master

Mesos Master

Mesos AgentMesos Agent ServiceCassandra Executor

Cassandra Task

Cassandra Scheduler

Container Scheduler

Spark Scheduler

Spark Executor

Spark Task

Mesos AgentMesos Agent ServiceDocker

Executor

Docker Task

Spark Executor

Spark Task

Two-level Scheduling1. Agents advertise resources to Master2. Master offers resources to Framework3. Framework rejects / uses resources4. Agent reports task status to Master

Datacenter Operating System (DC/OS)

Distributed Systems Kernel (Mesos)

DC/OS ENABLES MODERN DISTRIBUTED APPS

Big Data + Analytics EnginesMicroservices (in containers)

Streaming

Batch

Machine Learning

Analytics

Functions & Logic

Search

Time Series

SQL / NoSQL

Databases

Modern App Components

Any Infrastructure (Physical, Virtual, Cloud)

The SMACK Stack

EVENTS


INGEST

Apache Kafka

STORE

Apache Spark

ANALYZE

Apache Cassandra

ACT

Akka





applications

Apache Mesos

Sensors

Devices

Clients

The SMACK Stack

EVENTS


INGEST STOREANALYZE ACT





applications

DATA PROCESSING AT HYPERSCALE

EVENTS


INGEST

Apache Kafka

STORE

Apache Spark

ANALYZE

Apache Cassandra

ACT

Akka





applications

DC/OS

Sensors

Devices

Clients

MESSAGE QUEUES

Apache Kafka ØMQ, RabbitMQ, Disque (Redis-based), etc. fluentd, Logstash, Flume Akka streams cloud-only:

AWS SQS Google Cloud Pub/Sub

APACHE KAFKA

High-throughput, distributed, persistent publish-subscribe messaging system

Originates from LinkedIn Typically used as buffer/de-coupling

layer in online stream processing

fluentd


● Scalability● Message Type● Log vs …

● Delivery Guarantees/Message durability

● Routing Capabilities● Failover● Community● Mesos Support ;-)

HOW TO CHOOSE?

DELIVERY GUARANTEES

At most once—Messages may be lost but are never redelivered.

At least once—Messages are never lost but may be redelivered.

Exactly once—this is what people actually want, each message is delivered once and only once.

Murphy’s Law of Distributed Systems:

Anything that can go wrong, will go wrong … partially!

RoutingSimple Pipes Routing


EVENTS


INGEST

Apache Kafka

STORE

Apache Spark

ANALYZE

Apache Cassandra

ACT

Akka





applications

DC/OS

Sensors

Devices

Clients

STREAM PROCESSING• Apache Storm • Apache Spark • Apache Samza • Apache Flink • Apache Apex • Concord • cloud-only: AWS Kinesis,

Google Cloud Dataflow


APACHE SPARK

APACHE SPARK (STREAMING)

Typical Use: distributed, large-scale data processing; micro-batchingWhy Spark Streaming?

• Micro-batching creates very low latency, which can be faster

• Well defined role means it fits in well with other pieces of the pipeline


● Execution Model● Native Streaming vs Microbatch

● Fault Tolerance Granularity● Per record, per batch

● Delivery Guarantees● API

● SQL● Spark

● Performance….● Realtime ≠ Realtime

● Community● Mesos Support ;-)

HOW TO CHOOSE?

EXECUTION MODELMicro-Batching Native

Streaming

FAULT TOLERANCECheckpoint per “Batch”Ack-Per-Record Checkpoint per Batch

DELIVERY GUARANTEES“Exactly once”At least Once


EVENTS


INGEST

Apache Kafka

STORE

Apache Spark

ANALYZE

Apache Cassandra

ACT

Akka





applications

DC/OS

Sensors

Devices

Clients

Datastores

Data ModelKey-Value GraphRelational Document

● Schema

● SQL

● Foreign Keys/Joins

● OLTP/OLAP

● Simple

● Scalable

● Cache

FilesTime-Series● Complex

relations

● Social Graph

● Recommendation

● Fraud detections

● Schema-Less

● Semi-structured queries

● Product catalogue

● Session data

Demo Time

Generator Display

1. Financial data created by generator

2. Written to Kafka topics

3. Kafka Topics consumed by Flink

4. Flink pipeline operates on Kafka data

5. Results written back into Kafka stream (another topic)

6. Results displayed


Keep it running!

SERVICE OPERATIONS

● Configuration Updates (ex: Scaling, re-configuration) ● Binary Upgrades ● Cluster Maintenance (ex: Backup, Restore, Restart) ● Monitor progress of operations ● Debug any runtime blockages

CODEMOTION MILAN November 10/11th,2017http://milan2017.codemotionworld.com/

See you at


webinar - big data: let's smack - jorg schad

Technology