energy efficient architecture for graph analytics...
TRANSCRIPT
![Page 1: Energy Efficient Architecture for Graph Analytics Acceleratorsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-2.pdf · 3/7/2016 · Energy Efficient Architecture for Graph](https://reader033.vdocument.in/reader033/viewer/2022050410/5f879717a98a67377f3e7cc9/html5/thumbnails/1.jpg)
Energy Efficient Architecture for Graph Analytics Accelerators
ISCA’16
Mustafa Ozdal*, Serif Yesil*, Taemin Kim†, Andrey Ayupov†, John Greth†, Steven M. Burns†, Ozcan Ozturk*
* Bilkent University, Ankara, Turkey† Intel Corporation, Oregon, USA
![Page 2: Energy Efficient Architecture for Graph Analytics Acceleratorsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-2.pdf · 3/7/2016 · Energy Efficient Architecture for Graph](https://reader033.vdocument.in/reader033/viewer/2022050410/5f879717a98a67377f3e7cc9/html5/thumbnails/2.jpg)
Motivation Dark silicon era
Accelerator rich architectures: Customized hardware for specific applications
Hardware design is complex and time consuming
Many applications. Which ones to accelerate? Months of design effort.
Template based design: Capture commonalities for a domain
2
![Page 3: Energy Efficient Architecture for Graph Analytics Acceleratorsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-2.pdf · 3/7/2016 · Energy Efficient Architecture for Graph](https://reader033.vdocument.in/reader033/viewer/2022050410/5f879717a98a67377f3e7cc9/html5/thumbnails/3.jpg)
Graph Analytics
Model relationships between individual entities
Emerging application areas:Social networks, web, recommender systems, …
Example applications: PageRank, Collaborative Filtering, Loopy Belief Propagation, Betweenness Centrality, …
Graph-level parallelism & iterative algorithms
3
from Wikimedia
![Page 4: Energy Efficient Architecture for Graph Analytics Acceleratorsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-2.pdf · 3/7/2016 · Energy Efficient Architecture for Graph](https://reader033.vdocument.in/reader033/viewer/2022050410/5f879717a98a67377f3e7cc9/html5/thumbnails/4.jpg)
Graph Accelerator TemplateTargeted Graph Computation Pattern: Vertex-centric & Gather - Apply - Scatter (GAS)
We propose: Energy efficient accelerator architecture for irregular graph applications
Well-defined template to plug in different applications
Synthesizable SystemC models for architecture exploration & hardware generation
Design Productivity & Efficiency: Template code size : 39K lines, user code size 43 lines for PageRank
PageRank: 65X better power efficiency than 24 cores of Xeon CPU
4
![Page 5: Energy Efficient Architecture for Graph Analytics Acceleratorsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-2.pdf · 3/7/2016 · Energy Efficient Architecture for Graph](https://reader033.vdocument.in/reader033/viewer/2022050410/5f879717a98a67377f3e7cc9/html5/thumbnails/5.jpg)
Outline
Targeted Application Characteristics
Graph-Parallel Abstraction
Proposed Architecture
Experimental Results
5
![Page 6: Energy Efficient Architecture for Graph Analytics Acceleratorsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-2.pdf · 3/7/2016 · Energy Efficient Architecture for Graph](https://reader033.vdocument.in/reader033/viewer/2022050410/5f879717a98a67377f3e7cc9/html5/thumbnails/6.jpg)
Graph Analytics
Different than traditional HPC Irregular data access & communication
Poor cache locality
Computation-to-communication ratio very low
Irregular topologies due to scale-free graphs
Convergent algorithms Throughput vs. work-efficiency
Different implementation choices
High throughput easier to achieve than work efficiency
6
![Page 7: Energy Efficient Architecture for Graph Analytics Acceleratorsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-2.pdf · 3/7/2016 · Energy Efficient Architecture for Graph](https://reader033.vdocument.in/reader033/viewer/2022050410/5f879717a98a67377f3e7cc9/html5/thumbnails/7.jpg)
Asymmetric Convergence
Processing all vertices in every iteration is not work-efficient!
7
PageRank Execution
7% converge in 1 iteration
51% converge in 36 iterations
99.7% converge in 50 iterations
100% converge in 77 iterations
Similar observation was made in: Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J.M. Hellerstein, “Distributed Graphlab: A framework for machine learning and data mining in the cloud,” In Proc. of VLDB Endow., vol. 5, pp. 716-727, 2012
about 2x more edges processed for PageRank!
![Page 8: Energy Efficient Architecture for Graph Analytics Acceleratorsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-2.pdf · 3/7/2016 · Energy Efficient Architecture for Graph](https://reader033.vdocument.in/reader033/viewer/2022050410/5f879717a98a67377f3e7cc9/html5/thumbnails/8.jpg)
Synchronous vs. Asynchronous Execution
8
Jacobi iteration formula for PageRank:
𝑟𝑘+1 𝑣 =1 − 𝛼
𝑁+ 𝛼
(𝑢→𝑣)
𝑟𝑘(𝑢)
𝑑𝑒𝑔𝑟𝑒𝑒(𝑢)
Synchronous: All vertices are updated simultaneously.
Gauss-Seidel iteration formula for PageRank:
𝑟𝑘+1 𝑣 = 1 − 𝛼 + 𝛼σ 𝑢<𝑣(𝑢→𝑣)
𝑟𝑘+1(𝑢)
𝑑𝑒𝑔𝑟𝑒𝑒(𝑢)+ 𝛼σ 𝑢>𝑣
(𝑢→𝑣)
𝑟𝑘(𝑢)
𝑑𝑒𝑔𝑟𝑒𝑒(𝑢)
Asynchronous: Updates to a vertex are visible to others in the same iteration.Observed to be much faster to converge! (30-50% less work)
![Page 9: Energy Efficient Architecture for Graph Analytics Acceleratorsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-2.pdf · 3/7/2016 · Energy Efficient Architecture for Graph](https://reader033.vdocument.in/reader033/viewer/2022050410/5f879717a98a67377f3e7cc9/html5/thumbnails/9.jpg)
Throughput vs. Work Efficiency
9
Process all vertices
Easier to implement
High throughput
Worse work efficiency
Process active vertices only
Maintain worklist, dynamic work assignment
Lower throughput
Better work efficiency
Asymmetric Convergence
Synchronous
Easier to implement
High throughput
Worse work efficiency
Asynchronous
Fine-grain synchronization, sequential consistency support
Lower throughput
Better work efficiency
Iterative Execution Model
Ozdal, et. al. ICCAD 2015
![Page 10: Energy Efficient Architecture for Graph Analytics Acceleratorsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-2.pdf · 3/7/2016 · Energy Efficient Architecture for Graph](https://reader033.vdocument.in/reader033/viewer/2022050410/5f879717a98a67377f3e7cc9/html5/thumbnails/10.jpg)
Outline
Targeted Application Characteristics
Graph-Parallel Abstraction
Proposed Architecture
Experimental Results
10
![Page 11: Energy Efficient Architecture for Graph Analytics Acceleratorsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-2.pdf · 3/7/2016 · Energy Efficient Architecture for Graph](https://reader033.vdocument.in/reader033/viewer/2022050410/5f879717a98a67377f3e7cc9/html5/thumbnails/11.jpg)
Gather-Apply-Scatter Abstraction Abstraction proposed by Graphlab for distributed computing (Low, et. al. VLDB 2012)
Data structures associated with each vertex and edge
Compute operations defined for 3 stages of a vertex program:
11
GATHER APPLY SCATTER
![Page 12: Energy Efficient Architecture for Graph Analytics Acceleratorsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-2.pdf · 3/7/2016 · Energy Efficient Architecture for Graph](https://reader033.vdocument.in/reader033/viewer/2022050410/5f879717a98a67377f3e7cc9/html5/thumbnails/12.jpg)
Outline
Targeted Application Characteristics
Graph-Parallel Abstraction
Proposed Architecture
Experimental Results
12
![Page 13: Energy Efficient Architecture for Graph Analytics Acceleratorsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-2.pdf · 3/7/2016 · Energy Efficient Architecture for Graph](https://reader033.vdocument.in/reader033/viewer/2022050410/5f879717a98a67377f3e7cc9/html5/thumbnails/13.jpg)
13
ACCELERATOR UNIT
Active List Mgr: Maintains active vertices
Runtime: Schedules vertex computation
Gather Unit: Accumulates data from neighbors for a vertex
Apply Unit: Performs main computation for a vertex using gather results
Scatter Unit: Distributes the new data to neighbors; activates neighbors
Memory modules: Customized per graph data type
![Page 14: Energy Efficient Architecture for Graph Analytics Acceleratorsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-2.pdf · 3/7/2016 · Energy Efficient Architecture for Graph](https://reader033.vdocument.in/reader033/viewer/2022050410/5f879717a98a67377f3e7cc9/html5/thumbnails/14.jpg)
Compute UnitsGather Unit Neighbor vertices and edges accessed. Poor cache locality!
Latency tolerant: Tens of vertices and hundreds of edges processed concurrently. High MLP!
Storage for partial vertex and edge states with dynamic load balancing
Dependency between neighboring vertices handled through Sync Unit
Apply Unit Computation done on local data only
Scatter Unit Similar to Gather Unit
Memory writes in addition to reads
Neighbor vertex activations
14
![Page 15: Energy Efficient Architecture for Graph Analytics Acceleratorsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-2.pdf · 3/7/2016 · Energy Efficient Architecture for Graph](https://reader033.vdocument.in/reader033/viewer/2022050410/5f879717a98a67377f3e7cc9/html5/thumbnails/15.jpg)
Control UnitsSync Unit Ensures race-free and sequentially-consistent execution of vertices
Maintains execution states of vertices and assigns a rank for each vertex
Guarantees the proper RAW and WAR ordering for neighboring vertices
High-throughput processing
Active List Manager Active vertices stored in main memory with efficient caching
High-throughput access mechanisms
Race-free simultaneous accessed without explicit locks
Coordinates with Sync Unit for asynchronous execution
15
![Page 16: Energy Efficient Architecture for Graph Analytics Acceleratorsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-2.pdf · 3/7/2016 · Energy Efficient Architecture for Graph](https://reader033.vdocument.in/reader033/viewer/2022050410/5f879717a98a67377f3e7cc9/html5/thumbnails/16.jpg)
Multiple Accelerator Units
Banked design: Each unit responsible for a static subset of vertices
Two global light-weight modules: GTD: Global Termination Detector
GRC: Global Rank Counter
16
![Page 17: Energy Efficient Architecture for Graph Analytics Acceleratorsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-2.pdf · 3/7/2016 · Energy Efficient Architecture for Graph](https://reader033.vdocument.in/reader033/viewer/2022050410/5f879717a98a67377f3e7cc9/html5/thumbnails/17.jpg)
Outline
Targeted Application Characteristics
Graph-Parallel Abstraction
Proposed Architecture
Experimental Results
17
![Page 18: Energy Efficient Architecture for Graph Analytics Acceleratorsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-2.pdf · 3/7/2016 · Energy Efficient Architecture for Graph](https://reader033.vdocument.in/reader033/viewer/2022050410/5f879717a98a67377f3e7cc9/html5/thumbnails/18.jpg)
Benchmarks
Applications PageRank (PR)
Single Source Shortest Path (SSSP)
Stochastic Gradient Descent (SGD)
Loopy Belief Propagation (LBP)
Datasets PR & SSSP: 6 datasets from Snap and generated with Graph500 (up to 1B edges)
LBP: 3 images generated with GraphLab’s synthetic image generator (up to 18M edges)
SGD: 2 movie datasets from MovieLens (up to 10M edges)
18
![Page 19: Energy Efficient Architecture for Graph Analytics Acceleratorsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-2.pdf · 3/7/2016 · Energy Efficient Architecture for Graph](https://reader033.vdocument.in/reader033/viewer/2022050410/5f879717a98a67377f3e7cc9/html5/thumbnails/19.jpg)
Experimental SetupBaseline CPU 2-socket 24-core IvyBridge Xeon with 30MB LLC and 132GB of main memory
Optimized software implementations in OpenMP/C++
Running Average Power Limit (RAPL) to estimate energy
Projected DDR3 power (measured) to DDR4 power (in-house DDR4 model)
Proposed Accelerator Performance: Cycle accurate SystemC model + DRAMSim2
Accelerator power and area: HLS + physical-aware logic synthesis with a 22nm industrial library
Cache power and area: CACTI models
DRAM power: in-house DDR4 model
19
![Page 20: Energy Efficient Architecture for Graph Analytics Acceleratorsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-2.pdf · 3/7/2016 · Energy Efficient Architecture for Graph](https://reader033.vdocument.in/reader033/viewer/2022050410/5f879717a98a67377f3e7cc9/html5/thumbnails/20.jpg)
Performance Comparison
20
0
1
2
3
4
5
Accelerator Speed Up
24-cores 12-cores
![Page 21: Energy Efficient Architecture for Graph Analytics Acceleratorsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-2.pdf · 3/7/2016 · Energy Efficient Architecture for Graph](https://reader033.vdocument.in/reader033/viewer/2022050410/5f879717a98a67377f3e7cc9/html5/thumbnails/21.jpg)
Power Comparison
Accelerator power is dominated by DRAM power. Improvements would be ~10x higher without DRAM power
21
0
10
20
30
40
50
60
70
CPU Power / ACC Power
24-cores 12-cores
![Page 22: Energy Efficient Architecture for Graph Analytics Acceleratorsisca2016.eecs.umich.edu/wp-content/uploads/2016/07/3A-2.pdf · 3/7/2016 · Energy Efficient Architecture for Graph](https://reader033.vdocument.in/reader033/viewer/2022050410/5f879717a98a67377f3e7cc9/html5/thumbnails/22.jpg)
Conclusions A template architecture for graph-analytics is proposed
Latency tolerance for irregular accesses
Graph-parallel execution with sequential consistency
Asynchronous execution and active vertex set support
Synthesizable and cycle-accurate SystemC models
Different accelerators generated by plugging in app-specific functions
Template code size : 39K lines, user code size 43 lines for PageRank
Experiments with 22nm industrial libraries: Performance comparable with a 24-core Xeon system (except SSSP)
Up to 65x less power
22