bare metal library - nvidiaon-demand.gputechconf.com/gtc/.../s8154-bare-metal... · three use cases...
TRANSCRIPT
Bare Metal LibraryAbstractions for modern hardwareCyprien Noel
1. Modern Hardware?2. New challenges & opportunities3. Three use cases
○ Current solutions○ Leveraging hardware○ Simple abstraction
Plan
● High performance trading systems○ Lock-free algos, distributed systems
● H2O○ Distributed CPU machine learning, async SGD
● Flickr○ Scaling deep learning on GPU ⎼ Multi GPU Caffe○ RDMA, multicast, distributed Hogwild ⎼ CaffeOnSpark
● UC Berkeley○ NCCL Caffe, GPU cluster tooling○ Bare Metal
Myself
Modern Hardware?
Device-to-device networks
Number crunching ➔ GPU
FS, block io, virt mem ➔ Pmem
Network stack ➔ RDMA
RAID, replication ➔ Erasure codes
Device mem ➔ Coherent fabrics
And more:
Video, crypto etc.
Moving from ms software to µs hardware
OS abstractions replaced by
More powerful, but also more complex and non-interoperable
● CUDA● OFED● Libpmem● DPDK● SPDK● Libfabric● UCX● VMA● More every week...
Summary So Far
● Big changes coming!○ At least for high-performance applications
● CPU should orchestrate○ Not in critical path○ Device-to-device networks
● Retrofitting existing architectures difficult○ CPU-centric abstractions○ ms software on µs hardware (e.g. 100s instructions per packet)○ OK in some cases, e.g. VMA (kernel bypass sockets), but
much lower acceleration, most features inexpressible
What do we do?
● Start from scratch?○ E.g. Google Fushia - no fs, block io, network etc.○ Very interesting but future work
● Use already accelerated frameworks?○ E.g. PyTorch, BeeGFS○ Not general purpose, no interop, not device-to-device
● Work incrementally from use cases○ Look for simplest hardware solution○ Hopefully useful abstractions will emerge
Use cases
● Build datasets○ Add, update elements○ Apply functions to sets, map-reduce○ Data versioning
● Training & inference○ Compute graphs, pipelines○ Deployment○ Model versioning
Datasets
● Typical solution○ Protobuf messages○ KV store○ Dist. file system
● Limitations○ Serialization granularity○ Copies: kv log, kernel1, replication, kernel2, fs○ Remote CPU involved, stragglers○ Cannot place data in device
(x12)
Datasets
● Simplest hardware implementation○ Write protobuf in arena, like Flatbuffers○ Pick an offset on disks, e.g. a namespace○ Call ibv_exp_ec_encode_async
● Comments○ Management, coordination, crash resiliency○ Thin wrapper over HW: line rate perf.
● User abstraction?○ Simple, familiar○ Efficient, device friendly
(x12)
EC shard
● Extension to classic mmap○ Distributed○ Typed - Protobuf, other formats planned
● Protobuf is amazing○ Forward and backward compatible○ Lattice
mmap
mmap
● C++
const Test& test = mmap<Test>("/test");int i = test.field();
● Python
test = Test()bm.mmap("/test", test)i = test.field()
● Simple abstraction for data storage
● Fully accelerated, “mechanically friendly”○ Thin wrapper over HW, device-to-device, zero copy○ ~1.5x replication factor○ Network automatically balanced○ Solves straggler problem○ No memory pinning or TLB thrashing, NUMA aware
mmap, recap
Use cases
● Compute○ Map-reduce, compute graphs, pipelines
● Typical setup○ Spark, DL frameworks○ Distribution using Akka, gRPC, MPI○ Kubernetes or SLURM scheduling
● Limitations○ No interop○ Placement difficult○ Inefficient resources allocation
Compute
● Simplest hardware implementation○ Define a task, e.g. img. resize, CUDA kernel, PyTorch graph○ Place tasks in queue○ Work stealing - RDMA atomics○ Device-to-device chaining - GPU Direct Async
● User abstraction?
task
● Python
@bm.taskdef compute(x, y): return x * y
# Runs locallycompute(1, 2)
# Might be rebalanced on clusterdata = bm.list()bm.mmap("/data", data)compute(data, 2)
task, recap
● Simple abstraction for CPU and device kernels
● Work stealing instead of explicit schedule○ No GPU hoarding○ Better work balancing○ Dynamic placement, HA
● Device-to-device chaining○ Data placed directly in device memory○ Efficient pipelines, even very short tasks○ E.g. model parallelism, low latency inference
Use cases
● Versioning○ Track datasets and models○ Deploy / rollback models
● Typical setup○ Copy before update○ Symlinks as versions to data○ Staging / production environments split
Versioning
● Simplest hardware implementation○ Keep multiple write ahead logs○ mmap updates○ tasks queues
● User abstraction?
branch
● Like a git branch○ But any size data○ Simplifies collaboration, experimentation○ Generalized staging / production split
● Simplifies HA○ File system fsync, msync (Very hard! Rajimwale et al. DSN ‘11)○ Replaces transactions, e.g. queues, persistent memory○ Allows duplicate work merge
branch
● C++
Test* test = mutable_mmap<Test>("/test");branch b;# Only visible in current branchtest->set_field(12);
● Similar in Python
● mmap, task, and branch simplify hardware-acceleration
● Helps build pipelines, manage cluster resources etc.
● Early micro benchmarks suggest very high performance
Summary
Thank You!Will be open sourced BSD
Contact me if interested - [email protected]
Thanks to our sponsor