streaming data pipelines - icl utkicl.utk.edu/jlesc9/files/pta3.1/jlesc9_matri.pdf · streaming...

Streaming Data Pipelines

Pierre Matri, Philip Carns, Robert Latham, Shane Snyder, and Robert RossArgonne National Laboratory

Gabriel Antoniu and Alexandru Costan INRIA

Sam Gutierrez, Bob Robey, Brad Settlemyer, and Galen ShipmanLos Alamos National Laboratory

Jerome Soumagne and Neil FortnerThe HDF Group

George Amvrosiadis, Chuck Cranor, Greg Ganger, Ankush Jain, and Qing ZhengCarnegie Mellon University

KV Logs

Previously: Týr blob storage system

FrameworkHadoop M/R, Spark, Flink

FrameworkMPI

BDA Application HPC Application

HPCBDA

KV Logs DFS DFS

FrameworkHadoop M/R, Spark, Flink

FrameworkMPI

BDA Application HPC Application

KV Logs Unified DFS

Týr: Converging Storage Layer

Previously: Týr blob storage system

Pure-HPC use-cases

Buffering between source & processingIn-Situ VisualizationComputational Steering

Log-formatted data storageTime SeriesStreaming data (sensor events)

Checkpointing, recoverySimilar to cloud use-cases

Convergence use-case

Cross-platform application portabilityEnsure cross-platform portability when some basic structures are not available?

Cross-platform researchMany cloud algorithms using distributed logsLeverage those on HPC?Ex: Failure detection

Distributed logging on HPC?

Use-Case: LCLS-II

The Linac Coherent Light-Source @ Stanford World’s first hard X-ray free-electron laser

LCLS-II is an upgrade of the current LCLS

Use-Case: LCLS-II

EventsEvents

Events

Use-Case: LCLS-II

Data Pipeline Requirements

Scalability

Needs to scale to hundreds ofterabytes per second

= 170 million events per second

Simplicity

Building blocks should be availablefor simple use-cases

Variability

Event generation rate is highly-variabledepending on sensor data

Reproducibility

Results should be reproducible= storage

Each step is a process / microserviceA step exposes an API over RPC, with a single endpoint

Using Thallium for RPC [Mercury + Argobots]

Data pipeline model

Events can be augmented with tags (e.g., topics)

A pipeline is composed of a sequence of steps, performing actions on the events

High-level Steps

MapStep(Event) -> Event

FilterStep(Event) -> bool

TagStep(key, Event) -> value

TimedBatchStep(msecs, (set<Event>) -> Event)

CountBatchStep(count, (set<Event>) -> Event)

transforms an event

drops events not matching predicate

sets a tag on an event

time-based event aggregation

count-based event aggregation

Decoupling with storage

Successive steps should allow decouplingE.g., event bursts, buffering, persistence, offline processing, …

Ingress EgressBlobs, FS, …

MetaStep

High-level Storage Steps

MemoryStorageMetaStep()

BlobStorageMetaStep(host, blob_key)

FSStorageMetaStep(path)

Preliminary storage evaluation

Experiments on the Theta supercomputer, with:

Up to 100,000 event generators (1 per core)Simple pipeline, composed of a single storage step

8,192 parallel pipelines, with round-robin routing

This year: deployment

Data reduction pipeline composed of potentially hundreds of steps,that must be deployed alongside the application

= challenge in HPC

How do we describe / deploy hundreds / thousands of micro services on an HPC platform?

streaming data pipelines - icl utkicl.utk.edu/jlesc9/files/pta3.1/jlesc9_matri.pdf · streaming...

Documents

scalabledynamicdagbuildingfor data-flowtask-basedruntime...

mqtt kafka bridge · what is apache kafka? • a...

lass – linear algebra routines on...

aboutpipelinespstrust.org/wp...cepa-introductory-presentation.pdf ·...

streaming integration streaming... · 2020. 2. 28. ·...

open networking summit open source networking solving...

building realtime data pipelines with kafka connect and...

communication-avoiding & pipelined krylovsolvers in...

building spark streaming pipelines with cask hydrator, by...

chapter 5 pipelines, risers and subsea systems og... ·...

a cd framework for data pipelines - yow!...

accelerating outcomes in big data, iiot/iot, and ai/ml ·...

panopticon streams reference guide 2020 - altair ·...

video streaming © nanda ganesan, ph.d.. video streaming...

scalable distributed cascadiajs / seattle, wa w/ kafka on...

pipelines and pipe networks-i (pipelines connecting two...

using stream analytics in oracle integration...

resources pipelines andresources, pipelines, and hydraulic

coal institute conference: fundamentals update ·...

streaming auto-scaling in google cloud dataflow · given...