bare metal library - nvidiaon-demand.gputechconf.com/gtc/.../s8154-bare-metal... · three use cases...

Bare Metal LibraryAbstractions for modern hardwareCyprien Noel

1. Modern Hardware?2. New challenges & opportunities3. Three use cases

○ Current solutions○ Leveraging hardware○ Simple abstraction

Plan

● High performance trading systems○ Lock-free algos, distributed systems

● H2O○ Distributed CPU machine learning, async SGD

● Flickr○ Scaling deep learning on GPU ⎼ Multi GPU Caffe○ RDMA, multicast, distributed Hogwild ⎼ CaffeOnSpark

● UC Berkeley○ NCCL Caffe, GPU cluster tooling○ Bare Metal

Myself

Modern Hardware?

Device-to-device networks

Number crunching ➔ GPU

FS, block io, virt mem ➔ Pmem

Network stack ➔ RDMA

RAID, replication ➔ Erasure codes

Device mem ➔ Coherent fabrics

And more:

Video, crypto etc.

Moving from ms software to µs hardware

OS abstractions replaced by

More powerful, but also more complex and non-interoperable

● CUDA● OFED● Libpmem● DPDK● SPDK● Libfabric● UCX● VMA● More every week...

Summary So Far

● Big changes coming!○ At least for high-performance applications

● CPU should orchestrate○ Not in critical path○ Device-to-device networks

● Retrofitting existing architectures difficult○ CPU-centric abstractions○ ms software on µs hardware (e.g. 100s instructions per packet)○ OK in some cases, e.g. VMA (kernel bypass sockets), but

much lower acceleration, most features inexpressible

What do we do?

● Start from scratch?○ E.g. Google Fushia - no fs, block io, network etc.○ Very interesting but future work

● Use already accelerated frameworks?○ E.g. PyTorch, BeeGFS○ Not general purpose, no interop, not device-to-device

● Work incrementally from use cases○ Look for simplest hardware solution○ Hopefully useful abstractions will emerge

Use cases

● Build datasets○ Add, update elements○ Apply functions to sets, map-reduce○ Data versioning

● Training & inference○ Compute graphs, pipelines○ Deployment○ Model versioning

Datasets

● Typical solution○ Protobuf messages○ KV store○ Dist. file system

● Limitations○ Serialization granularity○ Copies: kv log, kernel1, replication, kernel2, fs○ Remote CPU involved, stragglers○ Cannot place data in device

(x12)

Datasets

● Simplest hardware implementation○ Write protobuf in arena, like Flatbuffers○ Pick an offset on disks, e.g. a namespace○ Call ibv_exp_ec_encode_async

● Comments○ Management, coordination, crash resiliency○ Thin wrapper over HW: line rate perf.

● User abstraction?○ Simple, familiar○ Efficient, device friendly

(x12)

EC shard

● Extension to classic mmap○ Distributed○ Typed - Protobuf, other formats planned

● Protobuf is amazing○ Forward and backward compatible○ Lattice

mmap

mmap

● C++

const Test& test = mmap<Test>("/test");int i = test.field();

● Python

test = Test()bm.mmap("/test", test)i = test.field()

● Simple abstraction for data storage

● Fully accelerated, “mechanically friendly”○ Thin wrapper over HW, device-to-device, zero copy○ ~1.5x replication factor○ Network automatically balanced○ Solves straggler problem○ No memory pinning or TLB thrashing, NUMA aware

mmap, recap

Use cases

● Compute○ Map-reduce, compute graphs, pipelines

● Typical setup○ Spark, DL frameworks○ Distribution using Akka, gRPC, MPI○ Kubernetes or SLURM scheduling

● Limitations○ No interop○ Placement difficult○ Inefficient resources allocation

Compute

● Simplest hardware implementation○ Define a task, e.g. img. resize, CUDA kernel, PyTorch graph○ Place tasks in queue○ Work stealing - RDMA atomics○ Device-to-device chaining - GPU Direct Async

● User abstraction?

task

● Python

@bm.taskdef compute(x, y): return x * y

# Runs locallycompute(1, 2)

# Might be rebalanced on clusterdata = bm.list()bm.mmap("/data", data)compute(data, 2)

task, recap

● Simple abstraction for CPU and device kernels

● Work stealing instead of explicit schedule○ No GPU hoarding○ Better work balancing○ Dynamic placement, HA

● Device-to-device chaining○ Data placed directly in device memory○ Efficient pipelines, even very short tasks○ E.g. model parallelism, low latency inference

Use cases

● Versioning○ Track datasets and models○ Deploy / rollback models

● Typical setup○ Copy before update○ Symlinks as versions to data○ Staging / production environments split

Versioning

● Simplest hardware implementation○ Keep multiple write ahead logs○ mmap updates○ tasks queues

● User abstraction?

branch

● Like a git branch○ But any size data○ Simplifies collaboration, experimentation○ Generalized staging / production split

● Simplifies HA○ File system fsync, msync (Very hard! Rajimwale et al. DSN ‘11)○ Replaces transactions, e.g. queues, persistent memory○ Allows duplicate work merge

branch

● C++

Test* test = mutable_mmap<Test>("/test");branch b;# Only visible in current branchtest->set_field(12);

● Similar in Python

● mmap, task, and branch simplify hardware-acceleration

● Helps build pipelines, manage cluster resources etc.

● Early micro benchmarks suggest very high performance

Summary

Thank You!Will be open sourced BSD

Contact me if interested - [email protected]

Thanks to our sponsor

bare metal library - nvidiaon-demand.gputechconf.com/gtc/.../s8154-bare-metal... · three use cases...

Documents