the university of north carolina at chapel hill & microsoft research gputerasort: high...

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH

GPUTeraSort: High Performance Graphics Co-processor Sorting for Large Data Management

Naga K. Govindaraju Jim Gray

Ritesh Kumar Dinesh Manocha

http://gamma.cs.unc.edu/GPUTERASORT

2


Sorting

“I believe that virtually every important aspect of programming arises somewhere in the context of sorting or searching!” -Don Knuth

3


Sorting

Well studied High performance computing DatabasesComputer graphicsProgramming languages...

Google map reduce algorithmSpec benchmark routine!

4


Massive Databases

Terabyte-data sets are commonGoogle sorts more than 100 billion terms in its index > 1 Trillion records in web indexed!

Database sizes are rapidly increasing!

Max DB sizes increases 3x per year (http://www.wintercorp.com)Processor improvements not matching information explosion

5


CPU vs. GPU

CPU(3 GHz)

System Memory(2 GB)

AGP Memory(512 MB)

PCI-E Bus(4 GB/s)

Video Memory(512 MB)

GPU (690 MHz)

Video Memory(512 MB)

GPU (690 MHz)

2 x 1 MB Cache

6


External Memory Sorting

Performed on Terabyte-scale databases

Two phases algorithm [Vitter01, Salzberg90, Nyberg94, Nyberg95]

Limited main memoryFirst phase – partitions input file into large data chunks and writes sorted chunks known as “Runs”Second phase – Merge the “Runs” to generate the sorted file

7



Performance mainly governed by I/O

Salzberg Analysis: Given the main memory size M and the file size N, if the I/O read size per run is T in phase 2, external memory sorting achieves efficient I/O performance if the run size R in phase 1 is given by R ≈ √(TN)

8



Given the main memory size M and the file size N, if the I/O read size per run is T in phase 2, external memory sorting achieves efficient I/O performance if the run size R in phase 1 is given by R ≈ √(TN)

N

9




R

10



T


11


Salzberg Analysis

If N=100GB, T=2MB, then R ≈ 230MB

Large data sorting on CPUs can achieve high I/O performance by sorting large runs

12


Massive Data Handling on CPUs

Require random memory accessesSmall CPU caches (< 2MB)Slower than even sequential disk accesses – bottleneck shift from I/O to memoryWidening memory to compute gap!

External memory sorting on CPUs can have low performance due to

High memory latency on account of cache missesOr low I/O performance

Sorting is hard!

13


Graphics Processing Units (GPUs)

Commodity processor for graphics applicationsMassively parallel vector processorsHigh memory bandwidth

Low memory latency pipelineProgrammable

High growth rate

14


GPU: Commodity Processor

Cell phones Laptops Consoles

PSPDesktops

15



Commodity processor for graphics applicationsMassively parallel vector processors

10x more operations per sec than CPUs

High memory bandwidthLow memory latency pipelineProgrammable

High growth rate

16


Parallelism on GPUs

Graphics FLOPS

GPU – 1.3 TFLOPS

CPU – 25.6 GFLOPS

17




Better hides memory latencyProgrammable10x more memory bandwidth than CPUs

High growth rate

18


vertex

setuprasterizer

pixel

texture

image

per-pixel texture, fp16 blending

Graphics Pipeline

programmable vertexprocessing (fp32)

programmable per-pixel math (fp32)

polygonpolygon setup,culling, rasterization

Z-buf, fp16 blending,anti-alias (MRT)

memory

Hides memory latency!!

Lo

w p

ipel

ine

dep

th56 GB/s

19


data

setuprasterizer

data

data

data

data fetch, fp16 blending

NON-Graphics Pipeline Abstraction

programmable MIMDprocessing (fp32)

programmable SIMDprocessing (fp32)

listsSIMD“rasterization”

predicated write, fp16blend, multiple output

memory

Courtesy: David Kirk,Chief Scientist, NVIDIA

20




Low memory latency pipelineProgrammable

High growth rate

21


Technology Trends: CPU and GPU

2.2GHz

4.4GHz

31 GHz

0.8 GHz

1.6 GHz

11.2

4.2

Lo

g o

f R

elat

ive

Pro

cess

ing

Po

wer

2002 2004 2006 2008

Corporate DT SW Requirements

Moore’s Law Trajectory

CPU

Value

Leading

Edge

Mobile

Mainstream Desktop

DT ‘Replacement’

Enthusiast / Specialty

Cooling (Cost)LimitationsGPU

Moore’s

Law 3 fo

r 18 m

o

Then Moore

’s La

w trajecto

ry

Graphics Req’m

ts

(enhanced experience)

Leading Edge

Value / UMA

?CPU

22


Architecture of Phase 1: GPUTeraSort

23


GPUs for Sorting: Issues

No support for arbitrary writesOptimized CPU algorithms do not map!Requires new algorithms – sorting networks

Lack of support for general data typesOut-of-core algorithms

Limited GPU memory

Difficult to program

24


General Sorting on GPUs

Sorting networks: No data dependencies

Utilize high parallelism on GPUs

To handle large keys, use bitonic radix sort

Perform bitonic sort on the 4 most significant bytes (MSB) using GPUs, compute sorted records with equal 4 MSBs, proceed to the next 4 bytes on those and so onCan handle any length keys

25


GPU-Based Sorting Networks

Represent data as 2D arrays

Multi-stage algorithmEach stage involves multiple steps

In each step1. Compare one array element against exactly

one other element at fixed distance2. Perform a conditional assignment (MIN or

MAX) at each element location

26


Flash animation removed to save (46MB !)

27


2D Memory Addressing

GPUs optimized for 2D representations

Map 1D arrays to 2D arraysMinimum and maximum regions mapped to row-aligned or column-aligned quads

28


1D – 2D Mapping

MIN MAX

29


1D – 2D Mapping

MIN

Effectively reduce instructions per element

30


Sorting on GPU: Pipelining and Parallelism

Input Vertices

Texturing, Caching and 2D Quad

Comparisons

Sequential Writes

31


Comparison with GPU-Based Algorithms

3-6x faster than prior GPU-based algorithms!

32


GPU vs. High-End Multi-Core CPUs

2-2.5x faster thanIntel high-end processors

Single GPU performance comparable tohigh-end dual core Athlon

Hand-optimized CPU code from Intel Corporation!

33


Super-Moore’s Law Growth

50 GB/s on a single GPU

Peak Performance: Effectively hide memory latency with 15 GOP/s

Download URL: http://gamma.cs.unc.edu/GPUSORT

Slash Dot News and Toms Hardware News Headlines

34


Implementation & Results

Pentium IV PC ($170)NVIDIA 7800 GT ($270)2 GB RAM ($152)9 80GB SATA disks ($477)SuperMicro Motherboard & SATA Controller ($325)Windows XP

PC costs $1469

35


Implementation & Results

Indy SortBenchmark10 byte random string keys100 byte long recordsSort maximum amount in 644 seconds

36


Overall Performance

Faster and more scalable than Dual Xeon processors (3.6 GHz)!

37


Performance/$

1.8x faster than current Terabyte sorter

World’s best price-to-performance system

http://research.microsoft.com/barc/SortBenchmark

38


Analysis: I/O Performance

Salzberg Analysis: 100 MB Run Size Peak

sequential throughput in MB/s

39



Pentium IV: 25MB Run Size (to reduce memory latency)

Less work and only 75% IO efficient!

Salzberg Analysis: 100 MB Run Size

40



Dual 3.6 GHz Xeons: 25MB Run size (to reduce memory latency)

More cores, less work but only 85% IO efficient!


41



7800 GT: 100MB run size

Ideal work, and 92% IO efficient with single CPU!


42


Task Parallelism

Performance limited by IO and memory

Sorting 100MB on GPU

Reorder or Sequential IO

Sorting 100MB on GPU: 3x > reorder or sequential IO

43


Why GPU-like Architectures for Large Data Management?

Plateau: Data Management Performance Crisis

GPU

44


Advantages

Exploit high memory bandwidth on GPUs

Higher memory performance than CPU-based algorithms

High I/O performance due to large run sizes

45


Advantages

Offload work from CPUsCPU cycles well-utilized for resource management

Scalable solution for large databases

Best performance/price solution for terabyte sorting

46


Limitations

May not work well on variable-sized keys and almost sorted databases

Requires programmable GPUs (GPUs manufactured after 2003)

47


Conclusions

Designed new sorting algorithms on GPUs

Handles wide keys and long records

Achieves 10x higher memory performance

Memory efficient sorting algorithm with peak memory performance of (50 GB/s) on GPUs 15 GOP/sec on a single GPU

48


Conclusions

Novel external memory sorting algorithm as a scalable solution

Achieves peak I/O performance on CPUsBest performance/price solution – world’s fastest sorting system

High performance growth rate characteristics

Improve 2-3 times/yr

49


Future Work

Designed high performance/price solutions

High wattage and cooling requirements of CPUs and GPUs

To exploit GPUs, we need easy-to-use programming APIs

Promising directions: BrookGPU, Microsoft Accelerator, Sh, etc.

Scientific libraries utilizing high parallelism and memory bandwidth

Scientific routines on LU, QR, SVD, FFT, etc.BLAS library on GPUsEventually, build GPU-LAPACK and Matlab routines

50


GPUFFTW

N. Govindaraju, S. Larsen, J. Gray and D. Manocha, Proc. of ACM SuperComputing, 2006 (to appear)

Download URL: http://gamma.cs.unc.edu/GPUFFTW

4x faster than IMKL on high-end Quad cores

SlashDot Headlines, May 2006

51


GPU Roadmap

GPUs are becoming more general purpose

Fewer limitations in Microsoft DirectX10 API• Better and consistent floating point

support,• Integer instruction support, • More programmable stages, etc.

Significant advance in performance

GPUs are being widely adopted in commercial applications

Eg. Microsoft Vista

52


Call to Action

Don’t put all your eggs in the Multi-core basketIf you want TeraOps – go where they areIf you want memory bandwidth– go where the memory bandwidth is. CPU-GPU gap is wideningMicrosoft Xbox is ½ TeraOP today.

40 gops

40 gBps

53


Acknowledgements

Research Sponsors:Army Research OfficeDefense and Advanced Research Projects AgencyNational Science FoundationNaval Research LaboratoryIntel CorporationMicrosoft Corporation

Craig Peeper, Peter-Pike Sloan, David Blythe, Jingren Zhou

NVIDIA CorporationRDECOM

54


Acknowledgements

David Tuft (UNC)

UNC Systems, GAMMA and Walkthrough groups

55


Thank You

Questions or Comments?

{naga,ritesh,dm}@cs.unc.edu

[email protected]

http://www.cs.unc.edu/~nagahttp://research.microsoft.com/~Gray

the university of north carolina at chapel hill & microsoft research gputerasort: high...

Documents