webinar - big data: let's smack - jorg schad

41
Big Data: Let's SMACK @joerg_schad

Upload: codemotion

Post on 21-Jan-2018

638 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Webinar - Big Data: Let's SMACK - Jorg Schad

Big Data: Let's SMACK @joerg_schad

Page 2: Webinar - Big Data: Let's SMACK - Jorg Schad

You can come and see me at Codemotion Milan 2017!

Talk: NO ONE PUTS Java IN THE

CONTAINER November 10th, 15:10-15.50

http://milan2017.codemotionworld.com/

Page 3: Webinar - Big Data: Let's SMACK - Jorg Schad

© 2017 Mesosphere, Inc. All Rights Reserved. 3

Jörg SchadSoftware Engineer @Mesosphere

@joerg_schad

@joerg.mesosphere

Page 4: Webinar - Big Data: Let's SMACK - Jorg Schad

© 2017 Mesosphere, Inc. All Rights Reserved. 4

MapReduce is crunching Data

Ancient Times...

Page 5: Webinar - Big Data: Let's SMACK - Jorg Schad

© 2016 Mesosphere, Inc. All Rights Reserved. 5

But then business demanded FAST DATA

We need to turn faster!

Today...

Page 6: Webinar - Big Data: Let's SMACK - Jorg Schad

Batch Event ProcessingMicro-Batch

Days Hours Minutes Seconds Microseconds

Solves problems using predictive and prescriptive analyticsReports what has happened using descriptive analytics

Predictive User InterfaceReal-time Pricing and Routing Real-time AdvertisingBilling, Chargeback Product recommendations

(Fast) Data Processing

Page 7: Webinar - Big Data: Let's SMACK - Jorg Schad

Fast Data Pipeline

EVENTS

Ubiquitous data streams from connected devices

INGEST STOREANALYZE ACT

Ingest millions of events per second

Distributed & highly scalable database

Real-time and batch process data

Visualize data and build data driven

applications

Page 8: Webinar - Big Data: Let's SMACK - Jorg Schad

The SMACK Stack

EVENTS

Ubiquitous data streams from connected devices

INGEST

Apache Kafka

STORE

Apache Spark

ANALYZE

Apache Cassandra

ACT

Akka

Ingest millions of events per second

Distributed & highly scalable database

Real-time and batch process data

Visualize data and build data driven

applications

Apache Mesos

Sensors

Devices

Clients

Page 9: Webinar - Big Data: Let's SMACK - Jorg Schad

© 2017 Mesosphere, Inc. All Rights Reserved. 9

Datacenter

Page 10: Webinar - Big Data: Let's SMACK - Jorg Schad

NAIVE APPROACH

Typical Datacentersiloed, over-provisioned servers,

low utilization

Industry Average 12-15% utilization

mySQL

microservice

Cassandra

Spark/Hadoop

Kafka

Page 11: Webinar - Big Data: Let's SMACK - Jorg Schad

© 2017 Mesosphere, Inc. All Rights Reserved. 11

Page 12: Webinar - Big Data: Let's SMACK - Jorg Schad

MULTIPLEXING OF DATA, SERVICES, USERS, ENVIRONMENTS

Typical Datacentersiloed, over-provisioned servers,

low utilization

Mesos/ DC/OSautomated schedulers, workload multiplexing onto the

same machines

mySQL

microservice

Cassandra

Spark/Hadoop

Kafka

Page 13: Webinar - Big Data: Let's SMACK - Jorg Schad

Apache Mesos• A top-level Apache project• A cluster resource negotiator• Scalable to 10,000s of nodes• Fault-tolerant, battle-tested• An SDK for distributed apps• Native Docker support

Page 14: Webinar - Big Data: Let's SMACK - Jorg Schad

MESOS: FUNDAMENTAL ARCHITECTURE

Mesos Master

Mesos Master

Mesos Master

Mesos AgentMesos Agent ServiceCassandra Executor

Cassandra Task

Cassandra Scheduler

Container Scheduler

Spark Scheduler

Spark Executor

Spark Task

Mesos AgentMesos Agent ServiceDocker

Executor

Docker Task

Spark Executor

Spark Task

Two-level Scheduling1. Agents advertise resources to Master2. Master offers resources to Framework3. Framework rejects / uses resources4. Agent reports task status to Master

Page 15: Webinar - Big Data: Let's SMACK - Jorg Schad

© 2017 Mesosphere, Inc. All Rights Reserved. 15

Page 16: Webinar - Big Data: Let's SMACK - Jorg Schad

Datacenter Operating System (DC/OS)

Distributed Systems Kernel (Mesos)

DC/OS ENABLES MODERN DISTRIBUTED APPS

Big Data + Analytics EnginesMicroservices (in containers)

Streaming

Batch

Machine Learning

Analytics

Functions & Logic

Search

Time Series

SQL / NoSQL

Databases

Modern App Components

Any Infrastructure (Physical, Virtual, Cloud)

Page 17: Webinar - Big Data: Let's SMACK - Jorg Schad

The SMACK Stack

EVENTS

Ubiquitous data streams from connected devices

INGEST

Apache Kafka

STORE

Apache Spark

ANALYZE

Apache Cassandra

ACT

Akka

Ingest millions of events per second

Distributed & highly scalable database

Real-time and batch process data

Visualize data and build data driven

applications

Apache Mesos

Sensors

Devices

Clients

Page 18: Webinar - Big Data: Let's SMACK - Jorg Schad

The SMACK Stack

EVENTS

Ubiquitous data streams from connected devices

INGEST STOREANALYZE ACT

Ingest millions of events per second

Distributed & highly scalable database

Real-time and batch process data

Visualize data and build data driven

applications

Page 19: Webinar - Big Data: Let's SMACK - Jorg Schad

The SMACK Stack

EVENTS

Ubiquitous data streams from connected devices

INGEST STOREANALYZE ACT

Ingest millions of events per second

Distributed & highly scalable database

Real-time and batch process data

Visualize data and build data driven

applications

Page 20: Webinar - Big Data: Let's SMACK - Jorg Schad

DATA PROCESSING AT HYPERSCALE

EVENTS

Ubiquitous data streams from connected devices

INGEST

Apache Kafka

STORE

Apache Spark

ANALYZE

Apache Cassandra

ACT

Akka

Ingest millions of events per second

Distributed & highly scalable database

Real-time and batch process data

Visualize data and build data driven

applications

DC/OS

Sensors

Devices

Clients

Page 21: Webinar - Big Data: Let's SMACK - Jorg Schad

MESSAGE QUEUES

Apache Kafka ØMQ, RabbitMQ, Disque (Redis-based), etc. fluentd, Logstash, Flume Akka streams cloud-only:

AWS SQS Google Cloud Pub/Sub

Page 22: Webinar - Big Data: Let's SMACK - Jorg Schad

APACHE KAFKA

High-throughput, distributed, persistent publish-subscribe messaging system

Originates from LinkedIn Typically used as buffer/de-coupling

layer in online stream processing

Page 23: Webinar - Big Data: Let's SMACK - Jorg Schad

fluentd

Page 24: Webinar - Big Data: Let's SMACK - Jorg Schad

© 2016 Mesosphere, Inc. All Rights Reserved. 24

● Scalability● Message Type● Log vs …

● Delivery Guarantees/Message durability

● Routing Capabilities● Failover● Community● Mesos Support ;-)

HOW TO CHOOSE?

Page 25: Webinar - Big Data: Let's SMACK - Jorg Schad

DELIVERY GUARANTEES

At most once—Messages may be lost but are never redelivered.

At least once—Messages are never lost but may be redelivered.

Exactly once—this is what people actually want, each message is delivered once and only once.

Murphy’s Law of Distributed Systems:

Anything that can go wrong, will go wrong … partially!

Page 26: Webinar - Big Data: Let's SMACK - Jorg Schad

RoutingSimple Pipes Routing

Page 27: Webinar - Big Data: Let's SMACK - Jorg Schad

DATA PROCESSING AT HYPERSCALE

EVENTS

Ubiquitous data streams from connected devices

INGEST

Apache Kafka

STORE

Apache Spark

ANALYZE

Apache Cassandra

ACT

Akka

Ingest millions of events per second

Distributed & highly scalable database

Real-time and batch process data

Visualize data and build data driven

applications

DC/OS

Sensors

Devices

Clients

Page 28: Webinar - Big Data: Let's SMACK - Jorg Schad

STREAM PROCESSING• Apache Storm • Apache Spark • Apache Samza • Apache Flink • Apache Apex • Concord • cloud-only: AWS Kinesis,

Google Cloud Dataflow

Page 29: Webinar - Big Data: Let's SMACK - Jorg Schad

© 2016 Mesosphere, Inc. All Rights Reserved. 29

APACHE SPARK

Page 30: Webinar - Big Data: Let's SMACK - Jorg Schad

APACHE SPARK (STREAMING)

Typical Use: distributed, large-scale data processing; micro-batchingWhy Spark Streaming?

• Micro-batching creates very low latency, which can be faster

• Well defined role means it fits in well with other pieces of the pipeline

Page 31: Webinar - Big Data: Let's SMACK - Jorg Schad

© 2016 Mesosphere, Inc. All Rights Reserved. 31

● Execution Model● Native Streaming vs Microbatch

● Fault Tolerance Granularity● Per record, per batch

● Delivery Guarantees● API

● SQL● Spark

● Performance….● Realtime ≠ Realtime

● Community● Mesos Support ;-)

HOW TO CHOOSE?

Page 32: Webinar - Big Data: Let's SMACK - Jorg Schad

EXECUTION MODELMicro-Batching Native

Streaming

Page 33: Webinar - Big Data: Let's SMACK - Jorg Schad

FAULT TOLERANCECheckpoint per “Batch”Ack-Per-Record Checkpoint per Batch

Page 34: Webinar - Big Data: Let's SMACK - Jorg Schad

DELIVERY GUARANTEES“Exactly once”At least Once

Page 35: Webinar - Big Data: Let's SMACK - Jorg Schad

DATA PROCESSING AT HYPERSCALE

EVENTS

Ubiquitous data streams from connected devices

INGEST

Apache Kafka

STORE

Apache Spark

ANALYZE

Apache Cassandra

ACT

Akka

Ingest millions of events per second

Distributed & highly scalable database

Real-time and batch process data

Visualize data and build data driven

applications

DC/OS

Sensors

Devices

Clients

Page 36: Webinar - Big Data: Let's SMACK - Jorg Schad

Datastores

Page 37: Webinar - Big Data: Let's SMACK - Jorg Schad

Data ModelKey-Value GraphRelational Document

● Schema

● SQL

● Foreign Keys/Joins

● OLTP/OLAP

● Simple

● Scalable

● Cache

FilesTime-Series● Complex

relations

● Social Graph

● Recommendation

● Fraud detections

● Schema-Less

● Semi-structured queries

● Product catalogue

● Session data

Page 38: Webinar - Big Data: Let's SMACK - Jorg Schad

Demo Time

Generator Display

1. Financial data created by generator

2. Written to Kafka topics

3. Kafka Topics consumed by Flink

4. Flink pipeline operates on Kafka data

5. Results written back into Kafka stream (another topic)

6. Results displayed

Page 39: Webinar - Big Data: Let's SMACK - Jorg Schad

© 2017 Mesosphere, Inc. All Rights Reserved. 39

Keep it running!

Page 40: Webinar - Big Data: Let's SMACK - Jorg Schad

SERVICE OPERATIONS

● Configuration Updates (ex: Scaling, re-configuration) ● Binary Upgrades ● Cluster Maintenance (ex: Backup, Restore, Restart) ● Monitor progress of operations ● Debug any runtime blockages

Page 41: Webinar - Big Data: Let's SMACK - Jorg Schad

CODEMOTION MILAN November 10/11th,2017http://milan2017.codemotionworld.com/

See you at