the university of north carolina at chapel hill & microsoft research gputerasort: high...
TRANSCRIPT
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
GPUTeraSort: High Performance Graphics Co-processor Sorting for Large Data Management
Naga K. Govindaraju Jim Gray
Ritesh Kumar Dinesh Manocha
http://gamma.cs.unc.edu/GPUTERASORT
2
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
Sorting
“I believe that virtually every important aspect of programming arises somewhere in the context of sorting or searching!” -Don Knuth
3
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
Sorting
Well studied High performance computing DatabasesComputer graphicsProgramming languages...
Google map reduce algorithmSpec benchmark routine!
4
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
Massive Databases
Terabyte-data sets are commonGoogle sorts more than 100 billion terms in its index > 1 Trillion records in web indexed!
Database sizes are rapidly increasing!
Max DB sizes increases 3x per year (http://www.wintercorp.com)Processor improvements not matching information explosion
5
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
CPU vs. GPU
CPU(3 GHz)
System Memory(2 GB)
AGP Memory(512 MB)
PCI-E Bus(4 GB/s)
Video Memory(512 MB)
GPU (690 MHz)
Video Memory(512 MB)
GPU (690 MHz)
2 x 1 MB Cache
6
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
External Memory Sorting
Performed on Terabyte-scale databases
Two phases algorithm [Vitter01, Salzberg90, Nyberg94, Nyberg95]
Limited main memoryFirst phase – partitions input file into large data chunks and writes sorted chunks known as “Runs”Second phase – Merge the “Runs” to generate the sorted file
7
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
External Memory Sorting
Performance mainly governed by I/O
Salzberg Analysis: Given the main memory size M and the file size N, if the I/O read size per run is T in phase 2, external memory sorting achieves efficient I/O performance if the run size R in phase 1 is given by R ≈ √(TN)
8
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
External Memory Sorting
Given the main memory size M and the file size N, if the I/O read size per run is T in phase 2, external memory sorting achieves efficient I/O performance if the run size R in phase 1 is given by R ≈ √(TN)
N
9
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
External Memory Sorting
Given the main memory size M and the file size N, if the I/O read size per run is T in phase 2, external memory sorting achieves efficient I/O performance if the run size R in phase 1 is given by R ≈ √(TN)
R
10
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
External Memory Sorting
T
Given the main memory size M and the file size N, if the I/O read size per run is T in phase 2, external memory sorting achieves efficient I/O performance if the run size R in phase 1 is given by R ≈ √(TN)
11
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
Salzberg Analysis
If N=100GB, T=2MB, then R ≈ 230MB
Large data sorting on CPUs can achieve high I/O performance by sorting large runs
12
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
Massive Data Handling on CPUs
Require random memory accessesSmall CPU caches (< 2MB)Slower than even sequential disk accesses – bottleneck shift from I/O to memoryWidening memory to compute gap!
External memory sorting on CPUs can have low performance due to
High memory latency on account of cache missesOr low I/O performance
Sorting is hard!
13
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
Graphics Processing Units (GPUs)
Commodity processor for graphics applicationsMassively parallel vector processorsHigh memory bandwidth
Low memory latency pipelineProgrammable
High growth rate
14
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
GPU: Commodity Processor
Cell phones Laptops Consoles
PSPDesktops
15
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
Graphics Processing Units (GPUs)
Commodity processor for graphics applicationsMassively parallel vector processors
10x more operations per sec than CPUs
High memory bandwidthLow memory latency pipelineProgrammable
High growth rate
16
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
Parallelism on GPUs
Graphics FLOPS
GPU – 1.3 TFLOPS
CPU – 25.6 GFLOPS
17
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
Graphics Processing Units (GPUs)
Commodity processor for graphics applicationsMassively parallel vector processorsHigh memory bandwidth
Better hides memory latencyProgrammable10x more memory bandwidth than CPUs
High growth rate
18
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
vertex
setuprasterizer
pixel
texture
image
per-pixel texture, fp16 blending
Graphics Pipeline
programmable vertexprocessing (fp32)
programmable per-pixel math (fp32)
polygonpolygon setup,culling, rasterization
Z-buf, fp16 blending,anti-alias (MRT)
memory
Hides memory latency!!
Lo
w p
ipel
ine
dep
th56 GB/s
19
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
data
setuprasterizer
data
data
data
data fetch, fp16 blending
NON-Graphics Pipeline Abstraction
programmable MIMDprocessing (fp32)
programmable SIMDprocessing (fp32)
listsSIMD“rasterization”
predicated write, fp16blend, multiple output
memory
Courtesy: David Kirk,Chief Scientist, NVIDIA
20
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
Graphics Processing Units (GPUs)
Commodity processor for graphics applicationsMassively parallel vector processorsHigh memory bandwidth
Low memory latency pipelineProgrammable
High growth rate
21
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
Technology Trends: CPU and GPU
2.2GHz
4.4GHz
31 GHz
0.8 GHz
1.6 GHz
11.2
4.2
Lo
g o
f R
elat
ive
Pro
cess
ing
Po
wer
2002 2004 2006 2008
Corporate DT SW Requirements
Moore’s Law Trajectory
CPU
Value
Leading
Edge
Mobile
Mainstream Desktop
DT ‘Replacement’
Enthusiast / Specialty
Cooling (Cost)LimitationsGPU
Moore’s
Law 3 fo
r 18 m
o
Then Moore
’s La
w trajecto
ry
Graphics Req’m
ts
(enhanced experience)
Leading Edge
Value / UMA
?CPU
22
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
Architecture of Phase 1: GPUTeraSort
23
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
GPUs for Sorting: Issues
No support for arbitrary writesOptimized CPU algorithms do not map!Requires new algorithms – sorting networks
Lack of support for general data typesOut-of-core algorithms
Limited GPU memory
Difficult to program
24
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
General Sorting on GPUs
Sorting networks: No data dependencies
Utilize high parallelism on GPUs
To handle large keys, use bitonic radix sort
Perform bitonic sort on the 4 most significant bytes (MSB) using GPUs, compute sorted records with equal 4 MSBs, proceed to the next 4 bytes on those and so onCan handle any length keys
25
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
GPU-Based Sorting Networks
Represent data as 2D arrays
Multi-stage algorithmEach stage involves multiple steps
In each step1. Compare one array element against exactly
one other element at fixed distance2. Perform a conditional assignment (MIN or
MAX) at each element location
26
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
Flash animation removed to save (46MB !)
27
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
2D Memory Addressing
GPUs optimized for 2D representations
Map 1D arrays to 2D arraysMinimum and maximum regions mapped to row-aligned or column-aligned quads
29
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
1D – 2D Mapping
MIN
Effectively reduce instructions per element
30
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
Sorting on GPU: Pipelining and Parallelism
Input Vertices
Texturing, Caching and 2D Quad
Comparisons
Sequential Writes
31
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
Comparison with GPU-Based Algorithms
3-6x faster than prior GPU-based algorithms!
32
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
GPU vs. High-End Multi-Core CPUs
2-2.5x faster thanIntel high-end processors
Single GPU performance comparable tohigh-end dual core Athlon
Hand-optimized CPU code from Intel Corporation!
33
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
Super-Moore’s Law Growth
50 GB/s on a single GPU
Peak Performance: Effectively hide memory latency with 15 GOP/s
Download URL: http://gamma.cs.unc.edu/GPUSORT
Slash Dot News and Toms Hardware News Headlines
34
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
Implementation & Results
Pentium IV PC ($170)NVIDIA 7800 GT ($270)2 GB RAM ($152)9 80GB SATA disks ($477)SuperMicro Motherboard & SATA Controller ($325)Windows XP
PC costs $1469
35
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
Implementation & Results
Indy SortBenchmark10 byte random string keys100 byte long recordsSort maximum amount in 644 seconds
36
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
Overall Performance
Faster and more scalable than Dual Xeon processors (3.6 GHz)!
37
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
Performance/$
1.8x faster than current Terabyte sorter
World’s best price-to-performance system
http://research.microsoft.com/barc/SortBenchmark
38
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
Analysis: I/O Performance
Salzberg Analysis: 100 MB Run Size Peak
sequential throughput in MB/s
39
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
Analysis: I/O Performance
Pentium IV: 25MB Run Size (to reduce memory latency)
Less work and only 75% IO efficient!
Salzberg Analysis: 100 MB Run Size
40
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
Analysis: I/O Performance
Dual 3.6 GHz Xeons: 25MB Run size (to reduce memory latency)
More cores, less work but only 85% IO efficient!
Salzberg Analysis: 100 MB Run Size
41
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
Analysis: I/O Performance
7800 GT: 100MB run size
Ideal work, and 92% IO efficient with single CPU!
Salzberg Analysis: 100 MB Run Size
42
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
Task Parallelism
Performance limited by IO and memory
Sorting 100MB on GPU
Reorder or Sequential IO
Sorting 100MB on GPU: 3x > reorder or sequential IO
43
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
Why GPU-like Architectures for Large Data Management?
Plateau: Data Management Performance Crisis
GPU
44
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
Advantages
Exploit high memory bandwidth on GPUs
Higher memory performance than CPU-based algorithms
High I/O performance due to large run sizes
45
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
Advantages
Offload work from CPUsCPU cycles well-utilized for resource management
Scalable solution for large databases
Best performance/price solution for terabyte sorting
46
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
Limitations
May not work well on variable-sized keys and almost sorted databases
Requires programmable GPUs (GPUs manufactured after 2003)
47
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
Conclusions
Designed new sorting algorithms on GPUs
Handles wide keys and long records
Achieves 10x higher memory performance
Memory efficient sorting algorithm with peak memory performance of (50 GB/s) on GPUs 15 GOP/sec on a single GPU
48
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
Conclusions
Novel external memory sorting algorithm as a scalable solution
Achieves peak I/O performance on CPUsBest performance/price solution – world’s fastest sorting system
High performance growth rate characteristics
Improve 2-3 times/yr
49
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
Future Work
Designed high performance/price solutions
High wattage and cooling requirements of CPUs and GPUs
To exploit GPUs, we need easy-to-use programming APIs
Promising directions: BrookGPU, Microsoft Accelerator, Sh, etc.
Scientific libraries utilizing high parallelism and memory bandwidth
Scientific routines on LU, QR, SVD, FFT, etc.BLAS library on GPUsEventually, build GPU-LAPACK and Matlab routines
50
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
GPUFFTW
N. Govindaraju, S. Larsen, J. Gray and D. Manocha, Proc. of ACM SuperComputing, 2006 (to appear)
Download URL: http://gamma.cs.unc.edu/GPUFFTW
4x faster than IMKL on high-end Quad cores
SlashDot Headlines, May 2006
51
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
GPU Roadmap
GPUs are becoming more general purpose
Fewer limitations in Microsoft DirectX10 API• Better and consistent floating point
support,• Integer instruction support, • More programmable stages, etc.
Significant advance in performance
GPUs are being widely adopted in commercial applications
Eg. Microsoft Vista
52
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
Call to Action
Don’t put all your eggs in the Multi-core basketIf you want TeraOps – go where they areIf you want memory bandwidth– go where the memory bandwidth is. CPU-GPU gap is wideningMicrosoft Xbox is ½ TeraOP today.
40 gops
40 gBps
53
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
Acknowledgements
Research Sponsors:Army Research OfficeDefense and Advanced Research Projects AgencyNational Science FoundationNaval Research LaboratoryIntel CorporationMicrosoft Corporation
Craig Peeper, Peter-Pike Sloan, David Blythe, Jingren Zhou
NVIDIA CorporationRDECOM
54
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
Acknowledgements
David Tuft (UNC)
UNC Systems, GAMMA and Walkthrough groups
55
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH
Thank You
Questions or Comments?
{naga,ritesh,dm}@cs.unc.edu
http://www.cs.unc.edu/~nagahttp://research.microsoft.com/~Gray