wade shen - eri summit2018/07/25  · • target o(seconds) for 100+tb kernels • support multiple...

29
WADE SHEN DISTRIBUTION STATEMENT A. Approved for public release. Distribution is Unlimited. PROGRAM MANAGER DARPA/MTO

Upload: others

Post on 25-May-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: WADE SHEN - ERI Summit2018/07/25  · • Target O(seconds) for 100+TB kernels • Support multiple representations equally • Sparse matrix operations & GraphBLAS • Meta-data laden

WADESHEN

DISTRIBUTION STATEMENT A. Approved for public release. Distribution is Unlimited.

PROGRAM MANAGERDARPA/MTO

Page 2: WADE SHEN - ERI Summit2018/07/25  · • Target O(seconds) for 100+TB kernels • Support multiple representations equally • Sparse matrix operations & GraphBLAS • Meta-data laden

HIERARCHICAL IDENTIFY VERIFY EXPLOIT (HIVE)

DISTRIBUTION STATEMENT A. Approved for public release. Distribution is Unlimited.

Page 3: WADE SHEN - ERI Summit2018/07/25  · • Target O(seconds) for 100+TB kernels • Support multiple representations equally • Sparse matrix operations & GraphBLAS • Meta-data laden

THE HIVE PROGRAM

Program Objective: A graph processing stack yielding a 1,000x increase in computational efficiency over GPU solutions for DoD graph analytics.

Static and streaming analytics for trillion edge graphs

DISTRIBUTION STATEMENT A. Approved for public release. Distribution is Unlimited.

Page 4: WADE SHEN - ERI Summit2018/07/25  · • Target O(seconds) for 100+TB kernels • Support multiple representations equally • Sparse matrix operations & GraphBLAS • Meta-data laden

MANY DOD PROBLEMS ARE GRAPH PROBLEMS

Intelligence Graph algorithm

Geolocation inference Label PropagationPersona de-aliasing Stochastic Graph MatchingTarget prioritization Personalized PageRankSeeded target discovery Vertex NominationOrganization discovery Local Community DetectionDetection of money laundering Query by ExampleLeadership detection Role prediction

Operations Graph algorithm

Target audience discovery Snowball SamplingNetwork mapping Community DetectionNetwork infrastructure discovery Graph ProjectionCyber attack detection Anomaly Detection

Support Graph algorithm

Logistics/routeplan

Hierarchical Hub Labeling

HR selection K-nearest neighborsOps planning Trellis search

DISTRIBUTION STATEMENT A. Approved for public release. Distribution is Unlimited.

Page 5: WADE SHEN - ERI Summit2018/07/25  · • Target O(seconds) for 100+TB kernels • Support multiple representations equally • Sparse matrix operations & GraphBLAS • Meta-data laden

GRAPH VS. NUMERIC WORKLOADS

• The graph processing fallacy• GPUs/supercomputers designed for matrix math• All graphs = sparse matrices• ∴ GPUs/supercomputers process graphs well

Scale poorly for data movement problems

CPU/GPUs use caches to optimize locality

Miss Hit

Graph problems driven by “random accesses”

DISTRIBUTION STATEMENT A. Approved for public release. Distribution is Unlimited.

GPUs and CPUs are poorly suited for graph processing

Page 6: WADE SHEN - ERI Summit2018/07/25  · • Target O(seconds) for 100+TB kernels • Support multiple representations equally • Sparse matrix operations & GraphBLAS • Meta-data laden

PROGRAM STRUCTURE

GPU benchmark team

1000x

Hardware/Software

HW – TA1 SW - TA2Co-design

DISTRIBUTION STATEMENT A. Approved for public release. Distribution is Unlimited.

Qualcomm Intel

Georgia TechUniversity of California - Davis

Pacific Northwest Nat’l Lab

Georgia Tech

Page 7: WADE SHEN - ERI Summit2018/07/25  · • Target O(seconds) for 100+TB kernels • Support multiple representations equally • Sparse matrix operations & GraphBLAS • Meta-data laden

DISTRIBUTION STATEMENT A. Approved for public release. Distribution is Unlimited.

Page 8: WADE SHEN - ERI Summit2018/07/25  · • Target O(seconds) for 100+TB kernels • Support multiple representations equally • Sparse matrix operations & GraphBLAS • Meta-data laden

JOSHUA

SENIOR PRINCIPAL ENGINEER, PHD

FRYMAN

INTEL DCG, DARPA HIVE PI

Copyright © 2018 Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, and others are trademarks of Intel Corporation in the U.S. and/or other countries. Distribution Statement A – Approved for Public Release, Distribution Unlimited

Page 9: WADE SHEN - ERI Summit2018/07/25  · • Target O(seconds) for 100+TB kernels • Support multiple representations equally • Sparse matrix operations & GraphBLAS • Meta-data laden

JOSHUA FRYMAN, PHD

DARPA HIVE PISENIOR PRINCIPAL ENGINEERINTEL DCG

INTEL’S HIVE:FUTURE GRAPH ANALYTICS

Copyright © 2018 Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, and others are trademarks of Intel Corporation in the U.S. and/or other countries. Distribution Statement A – Approved for Public Release, Distribution Unlimited

Page 10: WADE SHEN - ERI Summit2018/07/25  · • Target O(seconds) for 100+TB kernels • Support multiple representations equally • Sparse matrix operations & GraphBLAS • Meta-data laden

Copyright © 2018 Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, and others are trademarks of Intel Corporation in the U.S. and/or other countries. Distribution Statement A – Approved for Public Release, Distribution Unlimited

WHAT’S SO HARD ABOUT GRAPHS . . . ?

Behavior Dense Compute Graph Compute

Compute intensity / arithmeticproperties

Lots of computationally intensive math ops; some data massaging that’s spatio-temporal friendly

Not computationally (math) intensive; mostly scheduling memory accesses and control flow problems

Cacheline access behavior Worker threads use ~95% of their full cacheline; control threads are complicated

~50% of cachelines evicted with ≤ 16B used and ~75% with ≤ 32B used

Control flow inter-arrival behavior

Workers have long runs between branches; control threads are sufficiently predictable

~80% of branches occur inside dependent memory chains; extreme stress on pipelines and structures

Memory flow inter-arrival behavior

~65% of memory references back-to-back; excellent locality effectiveness

~65% of memory references back-to-back; nested dependent pointer chains

Page 11: WADE SHEN - ERI Summit2018/07/25  · • Target O(seconds) for 100+TB kernels • Support multiple representations equally • Sparse matrix operations & GraphBLAS • Meta-data laden

WHAT’S SO HARD ABOUT GRAPHS . . . ?

0%10%20%30%40%50%60%70%80%90%

100%

0 8 16 24 32 40 48 56 64Bytes Touched (R, W, AMO)

Per Instance-Lifetime CL Utilization (Any Residency: L1, L2, LL)

SpMV - RGraph500 - RTriangle Counting - RBFS - Rk-truss - RLouvain - CNerstrand - C

Copyright © 2018 Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, and others are trademarks of Intel Corporation in the U.S. and/or other countries. Distribution Statement A – Approved for Public Release, Distribution Unlimited

Page 12: WADE SHEN - ERI Summit2018/07/25  · • Target O(seconds) for 100+TB kernels • Support multiple representations equally • Sparse matrix operations & GraphBLAS • Meta-data laden

0.8

1.0

1.2

1.4

1.6

1.8

2.0

SpecInt2017 APEX DB/CSP ML/AI Graph

Workload Sensitivity

1GB Pages Zero TLB Perfect L2 Perfect L1 2x Mesh BW

4x Mesh BW 2x DRAM BW 4x DRAM BW 0.5x DRAM Lat

WHERE DOES “BUSINESS AS USUAL” TAKE US . . . ?

Copyright © 2018 Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, and others are trademarks of Intel Corporation in the U.S. and/or other countries. Distribution Statement A – Approved for Public Release, Distribution Unlimited

Page 13: WADE SHEN - ERI Summit2018/07/25  · • Target O(seconds) for 100+TB kernels • Support multiple representations equally • Sparse matrix operations & GraphBLAS • Meta-data laden

BACK TO THE BASICS – SYSTEM CO-DESIGNXeon Phi – k-Truss Graph Algorithm

Time

Frac

tion

Copyright © 2018 Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, and others are trademarks of Intel Corporation in the U.S. and/or other countries. Distribution Statement A – Approved for Public Release, Distribution Unlimited

Page 14: WADE SHEN - ERI Summit2018/07/25  · • Target O(seconds) for 100+TB kernels • Support multiple representations equally • Sparse matrix operations & GraphBLAS • Meta-data laden

BACK TO THE BASICS – SYSTEM CO-DESIGN

Disk Pool

Storage Nodes In-Memory “Raw” Pre-processing “Cooked” Compute Platform

Question: What’s the right problem to optimize here? [The Laws of {Amdahl, Gustafson, Little} Are Not Forgiving]

Copyright © 2018 Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, and others are trademarks of Intel Corporation in the U.S. and/or other countries. Distribution Statement A – Approved for Public Release, Distribution Unlimited

Page 15: WADE SHEN - ERI Summit2018/07/25  · • Target O(seconds) for 100+TB kernels • Support multiple representations equally • Sparse matrix operations & GraphBLAS • Meta-data laden

BUILDING A GRAPH ANALYTICS SYSTEM

• Intel is developing a HIVE solution• 1,000x Perf/W gain target on 100+TB

• Locality will be problematic• Divide and Conquer has imbalance• Dynamic graphs warp partitions

• Focus on a scalable platform at all levels• Memory, Network, and Compute• Target O(seconds) for 100+TB kernels

• Support multiple representations equally• Sparse matrix operations & GraphBLAS• Meta-data laden graph abstractions

• Opportunities to engage and partnerCopyright © 2018 Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, and others are trademarks of Intel Corporation in the U.S. and/or other countries. Distribution Statement A – Approved for Public Release, Distribution Unlimited

Page 16: WADE SHEN - ERI Summit2018/07/25  · • Target O(seconds) for 100+TB kernels • Support multiple representations equally • Sparse matrix operations & GraphBLAS • Meta-data laden

IMPACT AND OPPORTUNITIES

• Open-source Graph primitives and tools• Actively seeking workloads and datasets• Co-design targets with customer input

LLVM + libC (Phase 1) libStdC++ (Phase 2)

High-Level API & Runtime

Bare Metal Runtime

Intel SW API & Runtime

Intel Hardware

(Legacy)

Customer Workloads

Tools (gdb, etc.)

Frameworks (Caffe, Tensorflow, Neon nGraph, etc.)

Intel MKL, GraphBLAS, etc.LLVM

APIs

Applications

Raw

Per

form

ance

Ease

of U

se

Intel Graph Analytics Processor

Hardware Low-level APIs

Native ISA

Copyright © 2018 Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, and others are trademarks of Intel Corporation in the U.S. and/or other countries. Distribution Statement A – Approved for Public Release, Distribution Unlimited

Page 17: WADE SHEN - ERI Summit2018/07/25  · • Target O(seconds) for 100+TB kernels • Support multiple representations equally • Sparse matrix operations & GraphBLAS • Meta-data laden

DISTRIBUTION STATEMENT A. Approved for public release. Distribution is Unlimited.

Page 18: WADE SHEN - ERI Summit2018/07/25  · • Target O(seconds) for 100+TB kernels • Support multiple representations equally • Sparse matrix operations & GraphBLAS • Meta-data laden

SHEKAR

SENIOR DIRECTOR OF TECHNOLOGY

BORKAR

QUALCOMM

DISTRIBUTION STATEMENT A. Approved for public release. Distribution is Unlimited.

Page 19: WADE SHEN - ERI Summit2018/07/25  · • Target O(seconds) for 100+TB kernels • Support multiple representations equally • Sparse matrix operations & GraphBLAS • Meta-data laden

HONEYCOMBA GRAPH ANALYTICS PROCESSOR WITH HIGH EFFICIENCY

SHEKHAR BORKAR (PI)MATT RADECIC (PM)QUALCOMM INTELLIGENT SOLUTIONS, INC.JULY 2018

DISTRIBUTION STATEMENT A. Approved for public release. Distribution is unlimited.

This research was developed with funding from the Defense Advanced Research Projects Agency (DARPA).The views, opinions and/or findings expressed are those of the author and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government.

Page 20: WADE SHEN - ERI Summit2018/07/25  · • Target O(seconds) for 100+TB kernels • Support multiple representations equally • Sparse matrix operations & GraphBLAS • Meta-data laden

PROGRAM GOALS

Performance 10 GTEPs / Node

Energy Efficiency 0.5 GTEPs / W2 nJ / TE

Processing Efficiency 100x in Hardware10x in Software 1000x Total

Memory Efficiency 90% both, random & sequential accesses

Demonstration 16 Nodes160 GTEPs system

Scalability Beyond 16 nodes to Tera TEPs

SCALABLE 160 GTEPS SYSTEM, CONSUMING < 320 WATTS

DISTRIBUTION STATEMENT A. Approved for public release. Distribution is Unlimited.

Page 21: WADE SHEN - ERI Summit2018/07/25  · • Target O(seconds) for 100+TB kernels • Support multiple representations equally • Sparse matrix operations & GraphBLAS • Meta-data laden

HIVE GOALS COMPARED TO GRAPH-500 (Q4-2016)

GOALS ARE EVEN HARDER CONSIDERING GRAPH-500 IS PROBABLY NOT A GOOD REPRESENTATIVEDISTRIBUTION STATEMENT A. Approved for public release. Distribution is Unlimited.

Page 22: WADE SHEN - ERI Summit2018/07/25  · • Target O(seconds) for 100+TB kernels • Support multiple representations equally • Sparse matrix operations & GraphBLAS • Meta-data laden

CHALLENGES: 160 GTEPS @ 320 W

INVESTIGATION PRIORITIES: (1) INTRA-NODE DATA MOVEMENT, (2) INTERCONNECT, (3) COMPUTE

DISTRIBUTION STATEMENT A. Approved for public release. Distribution is Unlimited.

Page 23: WADE SHEN - ERI Summit2018/07/25  · • Target O(seconds) for 100+TB kernels • Support multiple representations equally • Sparse matrix operations & GraphBLAS • Meta-data laden

DISTRIBUTION STATEMENT A. Approved for public release. Distribution is unlimited.

SYSTEM ARCHITECTURE

0.5X performance@

100X lower power

Page 24: WADE SHEN - ERI Summit2018/07/25  · • Target O(seconds) for 100+TB kernels • Support multiple representations equally • Sparse matrix operations & GraphBLAS • Meta-data laden

SENSITIVITY TO PROCESSOR FREQUENCY

WORKLOADS ARE NOT VERY SENSITIVE TO PROCESSOR FREQUENCYDISTRIBUTION STATEMENT A. Approved for public release. Distribution is Unlimited.

Page 25: WADE SHEN - ERI Summit2018/07/25  · • Target O(seconds) for 100+TB kernels • Support multiple representations equally • Sparse matrix operations & GraphBLAS • Meta-data laden

SENSITIVITY TO DATA-MOVEMENT PERFORMANCE

Saturation?Cliff?

WORKLOADS ARE MORE SENSITIVE TO DATA-MOVEMENT PERFORMANCE

RelativePerformance

Peak Byte/Op

DISTRIBUTION STATEMENT A. Approved for public release. Distribution is Unlimited.

Page 26: WADE SHEN - ERI Summit2018/07/25  · • Target O(seconds) for 100+TB kernels • Support multiple representations equally • Sparse matrix operations & GraphBLAS • Meta-data laden

MULTI-THREADED WORKLOAD BEHAVIOR (NODE)

Single thread:Minimal performance change with BWBW overprovisioned by the platform

12 threads:Performance improves with BW

Same behavior with 50% higher latency

DISTRIBUTION STATEMENT A. Approved for public release. Distribution is Unlimited.

Page 27: WADE SHEN - ERI Summit2018/07/25  · • Target O(seconds) for 100+TB kernels • Support multiple representations equally • Sparse matrix operations & GraphBLAS • Meta-data laden

SIMULATED PERFORMANCE, ENERGY, POWER

DISTRIBUTION STATEMENT A. Approved for public release. Distribution is Unlimited.

Page 28: WADE SHEN - ERI Summit2018/07/25  · • Target O(seconds) for 100+TB kernels • Support multiple representations equally • Sparse matrix operations & GraphBLAS • Meta-data laden

SUMMARY

SYSTEM DESIGN MUST BE OPTIMIZED FOR DATA MOVEMENT!

• DARPA-hard goals, yet achievable!

• Simulation based workload analysis shows data movement dominates• Not much by compute

Therefore…

DISTRIBUTION STATEMENT A. Approved for public release. Distribution is Unlimited.

Page 29: WADE SHEN - ERI Summit2018/07/25  · • Target O(seconds) for 100+TB kernels • Support multiple representations equally • Sparse matrix operations & GraphBLAS • Meta-data laden

DISTRIBUTION STATEMENT A. Approved for public release. Distribution is Unlimited.