software design practices for large-scale automation
TRANSCRIPT
Design for Large-Scale Automation 12/30/2015
Ongoing...
Design for large-scale, high-performance, distributed software systems for
complex algorithms such as graph, optimization, prediction, and machine
learning.
Corrections/improvements are very welcome at [email protected] (Hao Xu)
Topics
● Large-scale Automation: Why Challenging?
● Design Principles: Coping with Complexity and Physicality
● Computation Paradigms: HPC, Spark, Tensorflow
● Designs: Logical, Physical, System levels
● Distributed and Iterative Algorithms: Partition, Sync, Iteration Trade-offs
● Smart QA: Protection, Auditing, Debug codes
Design Objectives for Large-scale Automation
● Scalability (growing)
● Extensibility (evolving)
● Performance (fast)
● Maintenance (controllable)
Scalability: Name of the Game
● Electronics simulation: mandatory for simulation software to scale with
Moore’s law
● Internet Applications: systems need to be ready for next 10x user growth and
feature evolution
● Knowledge Base: bigger system improves cross referencing and hence quality
of learning new knowledge
● Deep learning: capacity of system affects quality of latent features learned
and hence the prediction capability
● Internet of Things: as the name suggests...
What make it difficult? #1 Complexity
● Complexity is the TOP challenge for software engineering
● Usually grows with the scale of the system
○ exhibits different patterns at different scale
○ explodes with the number of software features
● The only way to handle complexity
○ “Divide and Conquer”
○ realized by various Design Principles
What make it difficult? #2 Physicality
● Software is physical, just like human
○ Results are stored in physical memory (RAM/ROM/Disk)
○ Computation is done in physical processing units (CPU/GPU/FPGA)
● Not feasible to build one gigantic machine that solves everything
○ System should live on machine farms
○ Data / Computation should be distributed
● Physicality complicates the design of systems
○ Data partition
○ Computation partition
Design Principles
Abstraction and Decoupling
Design Principles: The Philosophy
Design Principles for Coping with Complexity
● Abstraction (Vertical Divide & Conquer)
○ Core Abstractions
○ Hierarchization
● Decoupling (Horizontal Divide & Conquer)
○ Encapsulation
○ Layerization
Decoupling
Centerpiece of large-scale system design
Abstraction
Abstraction: Vertical Divide and Conquer
● Core Abstractions
○ the soul of large-scale systems
○ the root of abstraction hierarchy
○ higher level abstraction = better extensibility
● Hierarchization
○ simplification of system functionality graph
○ ideally mapped into tree structures (no loop)
○ the template for Object Oriented Design
○ need a balance b/w delegation & check
Decoupling: Horizontal Divide and Conquer
● Encapsulation
○ components encapsulate complex logic
○ API design for minimal interface
● Layerization
○ algorithms divided into layers
○ each layer handles a feature/algorithm
■ layer 1: Graph partition and communication
■ layer 2: Graph node property analysis
■ layer 3: User operation on Graph nodes
■ ...
The Priority of Abstractions for Project Management
● Core abstractions (1st Priority)
○ Determines functionality/scalability
● Library abstractions (2nd Priority)
○ Determines performance
● Logic abstractions (low priority)
○ Flows
○ Apps
○ Business logics
1
2
3
Computation Paradigms
Language level, Flow level, System level
Computation Paradigms: The Framework
Computation Paradigms
● What is Computation Paradigm?
○ Computation abstraction at different levels
○ Offers encapsulation and parallelism at different levels
○ Crucial to choose the right computation paradigm
● Computation Paradigm at different levels
○ Language level: Python, C, Scala
○ Flow level: Imperative, Symbolic, Functional programming
○ System level: Computation-centric (HPC) or Data-centric (e.g. Spark)
Flow level: Imperative Programming
● Imperative Programming: No native abstraction
○ C++ / Python / Java
○ Computation at instruction level
○ Task level parallel
Flow level: Functional Programming
● Functional Programming: Data abstraction
○ Scala / MapReduce
○ Immutable, Stateless function
● Pros
○ Offers Data level parallel
● Cons
○ Data read only, need to make another copy if update.
○ More memory consumption. Potential performance overhead.
Flow level: Symbolic Programming
● Symbolic Programming: Operator abstraction
○ Theano / TensorFlow
○ Operator level parallel
○ Graph model as base engine
● Pros
○ Offers high operator parallelism through graph propagation
● Cons
○ Not flexible for all programming tasks
○ May incur overhead handling with fine-grained operators
System level: Computation-Centric System (typical HPC 1)
● What is HPC
○ HPC is extreme parallel computing
○ Computation Partition
■ Communication delay aware
● Inter-node L3/L2/L1
● Intra-node interconnect 100gb/s
● Inter-cluster ethernet 1gb/s + Ram to Disk time
■ Physical architecture ware
● Register size etc
System level: Computation-Centric System (typical HPC 2)
● Parallel at different levels
○ Multi-threading
○ Multi-process
○ Distributed cluster
○ Mainstream communication: MPI
● Partition based on needs of communication
○ Minimize communication
○ Algorithm partition
○ Data partition
System level: Computation-Centric System (typical HPC 3)
● Exploit Heterogeneous Components
○ GPU acceleration (many small cores)
■ Model is too small; too much overhead; stays on CPU
■ Model is too large; exceeds GPU memory; do partial acceleration
■ Exchange memory with CPU through memory copy
○ FPGA (millions of gates)
○ SSD, RAID 0/1,5/10
● Disk IO
○ HDF5 parallel read/write
System level: Data-Centric System (Spark-like)
● Data partition: Physically distributed central DB
○ Serialization: boost:serialization(c++), pickling(Python)
● Scalable computation
○ Usually has a scheduler
○ Explicit scheduling: user defines computation graph nodes
○ Implicit scheduling: engine analyzes the computation graph
● Stateless
○ Good for debug, easy recover from failure
System level: Hybrid Architecture
● Hybrid Architecture Example: TensorFlow
○ Stochastic algorithms → use Data-centric model
■ E.g. Back propagation: Parameter Server
○ Deterministic algorithms → use Computation-centric (HPC) model
■ E.g. Common data sync among model partitions: Bulk Synchronous
Parallel
Designs: The Quality
Logical Design
Objectify, Modularize, Standardize
Logical Design
● Objectify everything
○ an object can have multiple copies for parallel computing
○ avoid singleton / global / static variables
○ top level should fall through, should not execute anything
Logical Design
● Standardize everything
○ Base Class for any task = function(data, parameters, executor_id)
○ schema (base class) for task
○ scheme for any data
○ schema for any function
○ schema for any parameter
● Benefits
○ higher level automation
○ potentially more intelligent system
Logical Design
● Modularize everything
○ encapsulate data by using setter / getter
○ encapsulate atomic or repeated functionality
○ #define any hard number
○ factorize long function or class
○ build shared libraries from bottom-up
■ communication lib
■ parallel computing lib
■ debug / reporting lib
Physical Design
Code, Memory, Performance
Physical Design: Code
● Source Code
○ component level decouple by folder
○ module level decouple by file
○ variable space decouple by namespace
● Code change
○ physical change (files/folders touched) should reflect logical change
○ change scope should narrow down as development goes
○ diff mangement
Physical Design: Memory 1)
● Memory is the #1 factor for performance
○ Code runs in memory, not in the air
● OS Memory Handling
○ Memory allocation, fragmentation, release etc
○ Tcmalloc VS jemalloc
■ Improves allocation/fragmentation
■ Still has issue on release
Physical Design: Memory 2)
● Interpreter Memory Handling
○ Garbage Collection
● Manual Memory Management
○ memory pooling is mandatory
○ memory lifecycle management for any large usage
Physical Design: Memory 3)
● Trade-offs
○ Depends on application
■ Memory critical: TC/JEmalloc
■ Memory and Performance critical: MMU
○ HPC is memory and performance critical
■ Parallel does not solve all the problem. Single machine performance is
still dominant factor
■ You should know the code very well to design manual MMU
○ Spark replacing JVM memory management with Tungsten project
Physical Design: Performance
● Performance Tuning
○ profiling, profiling, profiling...
○ lazy initialization / write / read
○ cache-aware design
■ cache-friendly data structure
● linked structure locality
■ cache-friendly algorithm
● read / write locality
System Design
Distributed, Parallel, Resilient
System Design
● Scalable Distributed System
○ DB Service: Data and Computation decouple
○ Task/Scheduler: Computation and Execution decouple
○ Query/Queue: Producer and Consumer decouple
System Design
● DB Service
○ Logically Centralized
■ Parameter Server
○ Physically distributed
■ Only routing / bookkeeping service on Master
■ Master capacity is not an issue
■ Computation locality on Slaves
System Design
● Parallel Computing
○ multi-threading
■ light overhead
■ shared memory, data exchange OK
○ multi-process
■ heavy overhead
■ separated memory space, more difficult data exchange
○ distributed multiple machine
■ balance between computation VS. communication
System Design
● TensorFlow Example
○ Multi-threading: Graph Execution Engine
■ BFS
■ DFS
○ Multi-machine: Graph partition
■ Edge-cut?
■ Vertex-cut?
System Design
● Fault Tolerance
○ Monitor granularity
■ system level: module behavior
■ flow level: major steps
■ algorithm level: major checkpoints
○ Persistence granularity
■ recovery depth
■ recovery contents
Distributed and Iterative Algorithms
Partition, Sync, Iterate, Global/Local Optimum
Distributed and Iterative Algorithms: The Lifeblood
Key Issues of Distributed Algorithms
● Data / Model partition
○ inference data partition; graph partition; datastore sharding
● Communication paradigm
○ Spark RDD; MPI; RPC
● Computation locality
○ locality-aware job scheduling; Yarn; Drill
● Parallel algorithm paradigm
○ Map/Reduce; Spark
● Multi-stage distributed flow
Distributed Deterministic Algorithms 1)
● What to sync?
○ what is the key information to stitch each pieces together
○ sync data to resemble single machine algorithm (rare but can be useful)
○ keep data local, sync results (map/reduce)
● When to sync?
○ lazy sync (e.g. Bulk Synchronous Parallel)
○ async (e.g. Parameter Server)
● Where to sync?
○ refactor algorithm by optimal sync point
Distributed Deterministic Algorithms 2)
● Trade-offs
○ performance
■ computation VS. communication
○ scalability
■ need scalable communication pattern
■ avoid point-to-point communication
Distributed Approximate Algorithms 1)
● QoR loss in distributed computing
○ for many algorithms, lack of global sync leads to QoR loss
○ full global sync is very expensive in communication cost
○ carefully choose sync points to maximize Performance / QoR Loss
● Self-healing Algorithms
○ some algorithms have less dependency on global sync
○ e.g. in Stochastic Optimization
■ global sync may be postponed to allow local optimum explored
■ however this nice feature is data / model dependant
Distributed Approximate Algorithms 2)
● Major challenges 1)
○ Trade-off on QoR?
■ approximation is inevitable, so what can be approximated?
■ not just an engineering problem
■ usually needs assessment on business impact
○ Solutions
■ for each approximation candidates, detail profiling on QoR loss
VS. Performance Gain VS. Business impact
Distributed Approximate Algorithms 3)
● Major challenges 2)
○ Hard to maintain?
■ Stochastic Algorithms: find deterministic in probability values
■ Graph algorithms: hard to trace in large-scale graph
○ Solutions
■ develop single machine algorithm first as golden
■ detailed testing and correlation for each parallelization step
■ detailed testing to understand result/error pattern on small data
Distributed Iterative Algorithms 1)
● Many algorithms for large-scale problem are iterative
○ Simulated Annealing; Genetic Algorithm; Graph Partition; PageRank;
Expectation Maximization; Loopy Belief Propagation etc
● Two Common approaches
○ Local computation + lazy Sync
○ Global computation with graph propagation
Distributed Iterative Algorithms 2)
● Distributed environment adds another layer of complexity
○ iterations need to be tuned, or completely re-designed
○ may become harder to converge
● Tuning iterations
○ Again, where to iterate?
■ spend runtime on key gainer
■ profiling of iterations VS. QoR gain
○ Tuning knots for convergence
■ iteration knots have very high impact on convergence
■ profiling of convergence parameters VS runtime VS QoR
Multi-stage Distributed Flow
● Data re-partition problem (“Shuffle” in Spark Language)
“In these distributed computation engines, the shuffle refers to the
repartitioning and aggregation of data during an all-to-all operation.
Understandably, most performance, scalability, and reliability issues that we
observe in production Spark deployments occur within the shuffle.”
http://blog.cloudera.com/blog/2015/01/improving-sort-performance-in-apache-spark-
its-a-double
Multi-stage Distributed Flow 1)
● Data re-partition problem (“Shuffle” in Spark Language)
○ unified partition VS. per-stage partition
■ per-stage partition fits algorithm better, but requires data
migration
○ global partition VS. stream partition
■ global partition fits algorithm better, but requires single machine to
hold all data for partition
■ stream partition + post-partition adjustment
Multi-stage Distributed Flow 2) ● Data re-partition problem (“Shuffle” in Spark Language)
○ QoR numerical dependence on the number of partitions
■ direct partitioning has numerical stability problem
■ fine-grained partition + post-partition coarsening is better
● Solutions
○ Hard to use standard library for high performance system
○ Best performance system is customized on:
■ Data volume
■ Computation intensity
■ (Multiple-stage) Algorithm parallelism
○ Always, keep a golden of single machine run, even for small input data!
Smart QA
cannot fix a bug unless you can reproduce it
cannot build a system unless you can test it
…...
Smart QA: The Guardian
Smart QA: Why
● Successful software must have good QA
○ A high level model of the system
○ Save time in debug
○ Save business in crisis
● Throughout Software Lifecycle
○ Development: test-driven development
○ Deployment: handles discrepancy b/w user env and dev env
○ Maintenance: predicts error, learns from failures, improves system
Protection Code
● Assert / Try, Except / Raise…
● Good to have:
○ Cases run through
○ Information on internal data, sometimes
● Too much of it?
○ hurts performance
● Need a balance
○ Input of external data → sanity check
○ Internal data → no check on high performance engine. System design and code
should ensure that
Auditing Code
● Check correctness from another angle
○ Rule based
■ Simply adds up the numbers to see if match
■ Use another algorithm, simpler, but does rough check
○ Data driven
■ Samples intermediate data from normal runs, issues alert when
runtime data distribution is different
Debug Code
● As important as functional code! (if not more)
● Essentially a high level abstraction on code OUTPUT
○ Not just debug
○ A reversed tree structure, with samples on key nodes
○ Grows intelligently with field practice
● Maintenance effort should decrease over time
○ Error handling/messaging system should mature through time
○ Bugs should be fixed in the right direction, not just workaround