convergence of computation and data workflows · pdf fileconvergence of computation and data...

35
Convergence of computation and data workflows IS-ENES Workshop on Workflows and Metadata Generation Lisbon, PORTUGAL V. Balaji NOAA/GFDL and Princeton University 28 September 2016 V. Balaji ([email protected]) Convergence 28 September 2016 1 / 35

Upload: lytuong

Post on 27-Mar-2018

219 views

Category:

Documents


2 download

TRANSCRIPT

Convergence of computation and data workflowsIS-ENES Workshop on Workflows and Metadata Generation

Lisbon, PORTUGAL

V. Balaji

NOAA/GFDL and Princeton University

28 September 2016

V. Balaji ([email protected]) Convergence 28 September 2016 1 / 35

Amy Langenhorst 1977-2016

Principal Developer of the FMS Runtime Environment FRE.

V. Balaji ([email protected]) Convergence 28 September 2016 2 / 35

Outline

1 Hardware DirectionsGPUs, MICs, ARMInexact computingEnergy cost of algorithms and data movement

2 A Graph ApproachDirected Acyclic GraphsConvergence of computation and dataFault tolerance across the workflow

3 Metadata and provenanceDevelopment and production workflowStatistical and scientific reproducibility

4 Summary

V. Balaji ([email protected]) Convergence 28 September 2016 3 / 35

Outline

1 Hardware DirectionsGPUs, MICs, ARMInexact computingEnergy cost of algorithms and data movement

2 A Graph ApproachDirected Acyclic GraphsConvergence of computation and dataFault tolerance across the workflow

3 Metadata and provenanceDevelopment and production workflowStatistical and scientific reproducibility

4 Summary

V. Balaji ([email protected]) Convergence 28 September 2016 4 / 35

Power-8 with NVLink

Figure courtesy IBM.V. Balaji ([email protected]) Convergence 28 September 2016 5 / 35

KNL Overview

Figure courtesy Intel.

V. Balaji ([email protected]) Convergence 28 September 2016 6 / 35

The inexorable triumph of commodity computing

... means ARM? From The Platform, Hemsoth (2015).V. Balaji ([email protected]) Convergence 28 September 2016 7 / 35

Irreproducible Computing, Inexact Hardware

Figure 1 from Düben et al, Phil. Trans. A, 2016. Which bits can weallow to be “inexactly” flipped? Lorenz 96 as canonical test case ofnon-linearity and chaos.

V. Balaji ([email protected]) Convergence 28 September 2016 8 / 35

Irreproducible Computing, Inexact Hardware

Figure 2 from Düben et al, Phil. Trans. A, 2016.

V. Balaji ([email protected]) Convergence 28 September 2016 9 / 35

COSMO: energy to solution

T. SchulthessENES HPC Workshop, Hamburg, March 17, 2014

Cray XE6 (Nov. 2011)

Cray XK7 (Nov. 2012)

Cray XC30 (Nov. 2012)

Cray XC30 hybrid (GPU) (Nov. 2013)

6.0

4.5

3.0

1.5

Current production code

1.75x

New HP2C funded code

1.41x

1.49x

2.51x

2.64x

6.89x

Energy to solution (kWh / ensemble member)

!6

3.93x

Figure courtesy Thomas Schulthess, CSCS.

V. Balaji ([email protected]) Convergence 28 September 2016 10 / 35

JPSY comparison across ESMs

Model Machine Resol SYPD CHSY JPSYCM4 gaea/c2 1.2⇥108 4.5 16000 8.92⇥108

CM4 gaea/c3 1.2⇥108 10 7000 3.40⇥108

Comparative measures of capability (SYPD), capacity (CHSY),and energy cost (JPSY) per “unit of science”.Can you have codes that are “slower but greener”? Algorithmsthat are “less accurate but more eco-friendly”?From Balaji et al (2016), in review at GMDD.http://goo.gl/Nj1c2N

V. Balaji ([email protected]) Convergence 28 September 2016 11 / 35

Workflows for the exascale

Billion-way concurrency still a daunting challenge for everyone: nomagic bullets anywhere to be found.Exotic hardware is on the way; this is quite likely the lastgeneration of conventional hardware. Computing is likely tobecome irreproducible.Software investment paid back in power savings (Schulthess).Energy to solution will become key metric.More threading needs to be found: to fit 1018 op/s within a 1 MWpower budget, an operation should be 1 pJ: data movement is⇠10 pJ to main memory; ⇠100 pJ on network!DARPA: commodity improvements will slow to a trickle within 10years: go back to specialized computing?DOE: double investment in exascale.

V. Balaji ([email protected]) Convergence 28 September 2016 12 / 35

A network of compute and data nodes

FRE and other elements in the GFDL modeling environment managethe complex scheduling of jobs across a distributed computingresource.V. Balaji ([email protected]) Convergence 28 September 2016 13 / 35

... a global network of compute and data nodes

Workflow task is to minimize data flow across the global network.Figure courtesy IPSL.

V. Balaji ([email protected]) Convergence 28 September 2016 14 / 35

Outline

1 Hardware DirectionsGPUs, MICs, ARMInexact computingEnergy cost of algorithms and data movement

2 A Graph ApproachDirected Acyclic GraphsConvergence of computation and dataFault tolerance across the workflow

3 Metadata and provenanceDevelopment and production workflowStatistical and scientific reproducibility

4 Summary

V. Balaji ([email protected]) Convergence 28 September 2016 15 / 35

Examples of DAG parallelism

DAG example: Cholesky Inversion

Source: Stan Tomov, ICL, University of Tennessee, Knoxville

DAG = Directed Acyclic Graph Can IFS use this technology?

ECMWF Seminar 2013

Figure courtesy George Mozdzynski, ECMWF.

V. Balaji ([email protected]) Convergence 28 September 2016 16 / 35

SWARM for DAGs

Jeffrey et al, IEEE Micro 2016.

V. Balaji ([email protected]) Convergence 28 September 2016 17 / 35

KNL Overview

Figure courtesy Intel.

V. Balaji ([email protected]) Convergence 28 September 2016 18 / 35

SWARM for DAGs: hardware implementation

Jeffrey et al, IEEE Micro 2016.

V. Balaji ([email protected]) Convergence 28 September 2016 19 / 35

NVRAM will blur distinction between memory andfilesystem

Hemsoth, 2014: http://goo.gl/3ZeOXtV. Balaji ([email protected]) Convergence 28 September 2016 20 / 35

NVRAM will blur distinction between memory andfilesystem

Hemsoth, 2014: http://goo.gl/3ZeOXt

V. Balaji ([email protected]) Convergence 28 September 2016 21 / 35

Work avoidance

Work avoidance: find minimal path to complete outputmake: traverse tree backwards; state is the filesystem state.cylc/chaco: traverse tree forwards; each task formulated as ano-op if outputs exist; fred contains state including tasks in flight.

V. Balaji ([email protected]) Convergence 28 September 2016 22 / 35

Work avoidance

Work avoidance: find minimal path to complete outputmake: traverse tree backwards; state is the filesystem state.cylc/chaco: traverse tree forwards; each task formulated as ano-op if outputs exist; fred contains state including tasks in flight.

V. Balaji ([email protected]) Convergence 28 September 2016 23 / 35

Work avoidance

Work avoidance: find minimal path to complete outputmake: traverse tree backwards; state is the filesystem state.cylc/chaco: traverse tree forwards; each task formulated as ano-op if outputs exist; fred contains state including tasks in flight.

V. Balaji ([email protected]) Convergence 28 September 2016 24 / 35

Use of cross-network message queues

MQ Cluster

MQ Apps

API

DB’s

IPS

L

IPS

L

IPSL User @ Browser | Command Line | Desktop

json

TGCC

MQ Relay

IDRIS

MQ Relay

CINES

MQ Relay

CNRM

MQ Relay

XXX

MQ Relay

msg msgmsgmsgmsg

IPSL have tested handling O(105) enqueues/dequeues per day.Google reports Rabbit service of O(106) per second! (more thanall SMS/WhatsApp/etc) https://goo.gl/GBlAAzAMQP: active messages containing instructions as well as data.

Figure courtesy Sébastien Denvil, IPSL.V. Balaji ([email protected]) Convergence 28 September 2016 25 / 35

Outline

1 Hardware DirectionsGPUs, MICs, ARMInexact computingEnergy cost of algorithms and data movement

2 A Graph ApproachDirected Acyclic GraphsConvergence of computation and dataFault tolerance across the workflow

3 Metadata and provenanceDevelopment and production workflowStatistical and scientific reproducibility

4 Summary

V. Balaji ([email protected]) Convergence 28 September 2016 26 / 35

Development and production workflowModel developers have different workflow priorities and requirements.

Production workflow benefits from coherence and similarity acrossruns.Development workflow requires extremely fine-grained access tocode, namelists, scripts. A lot of rules broken:

Favored IDE/UI is called vi!source code edits in user directoriesinput file modifications on the fly

Analysis workflow requires random access to local disk:inspiration-driven rather than industrial strengthStill benefit from regression testing harness: multiple compilers,platformsEmulators? e.g SoftFloathttp://www.jhauser.us/arithmetic/SoftFloat.htmlProvenance and metadata requirements relaxed for developmentworkflow.

V. Balaji ([email protected]) Convergence 28 September 2016 27 / 35

Statistical comparison across model versions

Live monitoring of model runs. From GFDL MDT Tracking Page...

V. Balaji ([email protected]) Convergence 28 September 2016 28 / 35

Statistical comparison across model versions

Are two runs the same or different? What difference in inputs isresponsible for the disrepancy? From GFDL MDT Tracking Page...V. Balaji ([email protected]) Convergence 28 September 2016 29 / 35

Multi-model ensembles for climate projection

Figure SPM.7 from the IPCC AR5 Report. Can be interpreted as themost general and rigorous test of scientific reproducibility.

V. Balaji ([email protected]) Convergence 28 September 2016 30 / 35

Multi-model ensembles for climate projection

Critically depends on software, metadata, and data standards: theEarth System Grid Federation (http://esgf.org): a 3 PBfederated archive.Workflows for replication, versioning, subsetting, QC, citation.

V. Balaji ([email protected]) Convergence 28 September 2016 31 / 35

Outline

1 Hardware DirectionsGPUs, MICs, ARMInexact computingEnergy cost of algorithms and data movement

2 A Graph ApproachDirected Acyclic GraphsConvergence of computation and dataFault tolerance across the workflow

3 Metadata and provenanceDevelopment and production workflowStatistical and scientific reproducibility

4 Summary

V. Balaji ([email protected]) Convergence 28 September 2016 32 / 35

Summary

Community is struggling to move from synchronous toasynchronous data flow on the coming hardware platforms.Hardware is blurring the line between cache and memory,memory and storage (deep hierarchy).Energy to solution as a benchmark across the entire workflow.Expressing entire workflow as graphs (DAGs).

Maximize parallelism across the entire graphMinimize graph traversal during fault recovery

Accommodating different needs for development and productionworkflows: relax provenance and metadata requirements duringdevelopment.Irreproducible computing: include statistical consistency testinginto workflow.

CMIP8 is going to be awesome!

V. Balaji ([email protected]) Convergence 28 September 2016 33 / 35

Summary

Community is struggling to move from synchronous toasynchronous data flow on the coming hardware platforms.Hardware is blurring the line between cache and memory,memory and storage (deep hierarchy).Energy to solution as a benchmark across the entire workflow.Expressing entire workflow as graphs (DAGs).

Maximize parallelism across the entire graphMinimize graph traversal during fault recovery

Accommodating different needs for development and productionworkflows: relax provenance and metadata requirements duringdevelopment.Irreproducible computing: include statistical consistency testinginto workflow.CMIP8 is going to be awesome!

V. Balaji ([email protected]) Convergence 28 September 2016 34 / 35

Thank you

Thank you.

V. Balaji ([email protected]) Convergence 28 September 2016 35 / 35