big containers, big orchestration, big data · pdf filebig containers, big orchestration, big...

Click here to load reader

Post on 09-Sep-2018

233 views

Category:

Documents

1 download

Embed Size (px)

TRANSCRIPT

  • BIG CONTAINERS, BIG ORCHESTRATION, BIG DATAWilliam Benton Red Hat, [email protected]

  • COMMONS GATHERINGSeattle | November 7#OCGathering2016

    BACKGROUND

  • COMMONS GATHERINGSeattle | November 7#OCGathering2016

    Mesos

    WHAT OUR CLUSTER LOOKED LIKE IN 2014

    Networked POSIX FS

    Spark executor

    Spark executor

    Spark executor

    Spark executor

    Spark executor

    Spark executor

    1

    2

    3

    4

    1

    1

    2

    3

    3

    4

  • Analytics is no longer a separate workload.Analytics is an essential component of modern data-driven applications.

  • COMMONS GATHERINGSeattle | November 7#OCGathering2016

    OUR GOALS

    git

  • COMMONS GATHERINGSeattle | November 7#OCGathering2016

    FORECAST

    Spark and microservices

    Architectures for analytics and applications

    Scheduling and storage

    Future work (and how to get involved)

  • SPARK AND MICROSERVICES

  • Apache Spark is a fast and general framework for distributed data processing.

  • Resilient Distributed Datasets are partitioned, lazy, and immutable homogeneous collections.

  • COMMONS GATHERINGSeattle | November 7#OCGathering2016

    RESILIENT DISTRIBUTED DATASETS

    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

    2 3 4 6 7 8 10 11 121 5 9 13 14 15 16

  • COMMONS GATHERINGSeattle | November 7#OCGathering2016

    RESILIENT DISTRIBUTED DATASETS

  • COMMONS GATHERINGSeattle | November 7#OCGathering2016

    1 2 3 x: x % 2 != 0 x: x * 3FILTER MAP

    x: [x, x+1]

    FLATMAP

  • COMMONS GATHERINGSeattle | November 7#OCGathering2016

    3 x: x % 2 != 0 x: x * 3FILTER MAP

    x: [x, x+1]

    FLATMAP

    3 4 9 10COLLECT

  • COMMONS GATHERINGSeattle | November 7#OCGathering2016

    1 2 3 x: x % 2 != 0 x: x * 3FILTERMAP

    x: [x, x+1]

    FLATMAP

    3 4 9 10SAVE AS TEXT FILECACHE

  • COMMONS GATHERINGSeattle | November 7#OCGathering2016

    executor1

    1 2 3

    executorn

    10 11 12

    cluster manager

    2 4 6 20 22 24

    x: x * 2 x: x * 2

    driver

    CACHCACH

  • COMMONS GATHERINGSeattle | November 7#OCGathering2016

    Spark core

    Graph SQL ML Streaming

    ad hoc Mesos YARN

  • COMMONS GATHERINGSeattle | November 7#OCGathering2016

    Spark core

    Graph SQL ML Streaming

    ad hoc Mesos YARNk8s

  • A microservice architecture employs lightweight, modular, and typically stateless components with well-defined interfaces and contracts.

  • COMMONS GATHERINGSeattle | November 7#OCGathering2016

    BENEFITS OF MICROSERVICE ARCHITECTURES

  • COMMONS GATHERINGSeattle | November 7#OCGathering2016

    BENEFITS OF MICROSERVICE ARCHITECTURES

  • COMMONS GATHERINGSeattle | November 7#OCGathering2016

    BENEFITS OF MICROSERVICE ARCHITECTURES

    2 + 2 5

  • COMMONS GATHERINGSeattle | November 7#OCGathering2016

    MICROSERVICES AND SPARK

    executor

    1 2 3

    executor

    4 5 6

    executor

    7 8 9

    executor

    10 11 12

    master

    x: x * 22 4 6 8 10 12 14 16 18 20 22 24

    x: x * 2 x: x * 2 x: x * 2 x: x * 2

  • ARCHITECTURES FOR ANALYTICS AND APPLICATIONS

  • COMMONS GATHERINGSeattle | November 7#OCGathering2016

    APPLICATION RESPONSIBILITIES

    archive

    trainmodels

    transform

    transform

    transform

    aggregate

    events

    databases

    file, object storage

  • COMMONS GATHERINGSeattle | November 7#OCGathering2016

    APPLICATION RESPONSIBILITIES

    archive

    trainmodels

    transform

    transform

    transform

    aggregate

    events

    databases

    file, object storage

    management

    web and mobile

    reporting

    developer UI

  • LEGACY ARCHITECTURES

  • COMMONS GATHERINGSeattle | November 7#OCGathering2016

    transactionprocessing

    CONVENTIONAL DATA WAREHOUSE

    transformevents

    UI business logic

    RDBMS

  • COMMONS GATHERINGSeattle | November 7#OCGathering2016

    transactionprocessing

    CONVENTIONAL DATA WAREHOUSE

    transformevents

    UI business logic

    RDBMS analytic processing

    RDBMS

    analysis

    interactive queryreporting

  • COMMONS GATHERINGSeattle | November 7#OCGathering2016

    HADOOP-STYLE DATA LAKE

    HDFS

    events

    HDFS HDFS HDFS HDFS

  • COMMONS GATHERINGSeattle | November 7#OCGathering2016

    HADOOP-STYLE DATA LAKE

    HDFS

    compute

    events

    HDFS

    compute

    HDFS

    compute compute compute

    HDFS HDFS

  • MODERN ARCHITECTURES

  • COMMONS GATHERINGSeattle | November 7#OCGathering2016

    serving layerspeed layer

    THE LAMBDA ARCHITECTURE

    events

    batch layer

    UIfederate

    (precise)analysistransform

    (imprecise)analysistransform

    DFS

  • COMMONS GATHERINGSeattle | November 7#OCGathering2016

    queue for raw data topic

    THE KAPPA ARCHITECTURE

    events

    transform analysis

    queue for preprocessed data topic

    queue for analysis results topic

    reporting end-user UI

  • COMMONS GATHERINGSeattle | November 7#OCGathering2016

    DATA FEDERATION IN THE COMPUTE LAYER

    aggregate

    trainmodels

    archive

    events

    databases

    file, object storage

    management

    web and mobile

    reporting

    developer UItransform

    transform

    transform

  • PRACTICALITIES AND POTENTIAL PITFALLS

  • COMMONS GATHERINGSeattle | November 7#OCGathering2016

    Cluster scheduler

    SIDEBAR: THE MONOLITHIC SPARK ANTIPATTERN

    Shared FSSpark executor

    Spark executor

    Spark executor

    Spark executor

    Spark executor

    Spark executor

    Resource manager

    app 1 app 2

    app 4app 3

  • COMMONS GATHERINGSeattle | November 7#OCGathering2016

    OpenShift

    ONE CLUSTER PER APPLICATION

    Object storesapp 1 app 2

    app 5app 4

    app 3

    app 6

    app 1 app 2

    app 5app 4

    app 3

    app 6

    Databases

  • COMMONS GATHERINGSeattle | November 7#OCGathering2016

    OpenShift

    app 1 app 2

    app 5app 4

    app 3

    app 6

    app 1

  • COMMONS GATHERINGSeattle | November 7#OCGathering2016

    OpenShift

    app 1 app 2

    app 5app 4

    app 3

    app 6

    app 1 app 2

    app 5app 4

    app 3

    app 6

    POSIX FS

    HDFS HDFS

    HDFS HDFS

    HDFS

    HDFS

  • COMMONS GATHERINGSeattle | November 7#OCGathering2016

    OpenShift

    app 1 app 2

    app 5app 4

    app 3

    app 6

    app 1 app 2

    app 5app 4

    app 3

    app 6

    object store

    interoperability fine-grained AC many implementations

    consistency model performance

  • For the workloads from Facebook and Bing, we see that 96% and 89% of the active jobs respectively can have their data entirely fit in memory, given an allowance of 32GB memory per server for caching

    PACMan: Coordinated Memory Caching for Parallel Jobs. G. Ananthanarayanan et al., in Proceedings of NSDI 12.

  • Recent studies have shown that reading data from local disks is only about 8% faster than reading it from remote disks over the network [and] this 8% number is decreasing.

    Tom Phelan, The Elephant in the Big Data Room: Data Locality is Irrelevant for Hadoop (goo.gl/MnCKuM)

    http://goo.gl/MnCKuM

  • Three out of ten hours of job runtime were spent moving files from the staging directory to the final directory in HDFSWe were essentially compressing, serializing, and replicating three copies for a single read.

    Apache Spark @Scale: a 60+ TB production use caseFacebook Engineering Blog Post

  • COMMONS GATHERINGSeattle | November 7#OCGathering2016

    executor1 executornCACHCACH

    driver

  • COMMONS GATHERINGSeattle | November 7#OCGathering2016

    COLOCATED COMPUTE AND STORAGE: YAGNI

    Disk locality is just another kind of caching, but memory is much faster than disk and working set sizes typically fit in cluster memory after ETL.

    The I/O-heavy behavior of frameworks designed for colocated compute and storage performs worse than iterative processing in memory.

    Colocating compute and storage prevents independent scale-out of compute and turns cattle into pets.

  • COMMONS GATHERINGSeattle | November 7#OCGathering2016

    BUT IF YOU DOOpenShift

    app 1 app 2 app 3app 1 app 2 app 3

    Storage

  • COMMONS GATHERINGSeattle | November 7#OCGathering2016

    BUT IF YOU DOOpenShift

    app 1 app 2 app 3app 1 app 2 app 3

    Storage Storage Storage

  • PLAYING ALONG AT HOME

  • COMMONS GATHERINGSeattle | November 7#OCGathering2016

    TRY IT OUT YOURSELF

    Enabling Spark on OpenShift: https://github.com/radanalyticsio

    Video demo: https://vimeo.com/189710503

    Meet the teams at lunch!

    https://github.com/radanalyticsiohttps://vimeo.com/189710503

  • @willb [email protected] https://chapeau.freevariable.com

    THANKS!

    mailto:[email protected]?subject=https://chapeau.freevariable.com

View more