the university of north carolina at chapel hill & microsoft research gputerasort: high...

55
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH GPUTeraSort: High Performance Graphics Co-processor Sorting for Large Data Management Naga K. Govindaraju Jim Gray Ritesh Kumar Dinesh Manocha http://gamma.cs.unc.edu/GPUTERASORT

Upload: jeremy-parks

Post on 03-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH

GPUTeraSort: High Performance Graphics Co-processor Sorting for Large Data Management

Naga K. Govindaraju Jim Gray

Ritesh Kumar Dinesh Manocha

http://gamma.cs.unc.edu/GPUTERASORT

2

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH

Sorting

“I believe that virtually every important aspect of programming arises somewhere in the context of sorting or searching!” -Don Knuth

3

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH

Sorting

Well studied High performance computing DatabasesComputer graphicsProgramming languages...

Google map reduce algorithmSpec benchmark routine!

4

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH

Massive Databases

Terabyte-data sets are commonGoogle sorts more than 100 billion terms in its index > 1 Trillion records in web indexed!

Database sizes are rapidly increasing!

Max DB sizes increases 3x per year (http://www.wintercorp.com)Processor improvements not matching information explosion

5

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH

CPU vs. GPU

CPU(3 GHz)

System Memory(2 GB)

AGP Memory(512 MB)

PCI-E Bus(4 GB/s)

Video Memory(512 MB)

GPU (690 MHz)

Video Memory(512 MB)

GPU (690 MHz)

2 x 1 MB Cache

6

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH

External Memory Sorting

Performed on Terabyte-scale databases

Two phases algorithm [Vitter01, Salzberg90, Nyberg94, Nyberg95]

Limited main memoryFirst phase – partitions input file into large data chunks and writes sorted chunks known as “Runs”Second phase – Merge the “Runs” to generate the sorted file

7

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH

External Memory Sorting

Performance mainly governed by I/O

Salzberg Analysis: Given the main memory size M and the file size N, if the I/O read size per run is T in phase 2, external memory sorting achieves efficient I/O performance if the run size R in phase 1 is given by R ≈ √(TN)

8

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH

External Memory Sorting

Given the main memory size M and the file size N, if the I/O read size per run is T in phase 2, external memory sorting achieves efficient I/O performance if the run size R in phase 1 is given by R ≈ √(TN)

N

9

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH

External Memory Sorting

Given the main memory size M and the file size N, if the I/O read size per run is T in phase 2, external memory sorting achieves efficient I/O performance if the run size R in phase 1 is given by R ≈ √(TN)

R

10

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH

External Memory Sorting

T

Given the main memory size M and the file size N, if the I/O read size per run is T in phase 2, external memory sorting achieves efficient I/O performance if the run size R in phase 1 is given by R ≈ √(TN)

11

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH

Salzberg Analysis

If N=100GB, T=2MB, then R ≈ 230MB

Large data sorting on CPUs can achieve high I/O performance by sorting large runs

12

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH

Massive Data Handling on CPUs

Require random memory accessesSmall CPU caches (< 2MB)Slower than even sequential disk accesses – bottleneck shift from I/O to memoryWidening memory to compute gap!

External memory sorting on CPUs can have low performance due to

High memory latency on account of cache missesOr low I/O performance

Sorting is hard!

13

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH

Graphics Processing Units (GPUs)

Commodity processor for graphics applicationsMassively parallel vector processorsHigh memory bandwidth

Low memory latency pipelineProgrammable

High growth rate

14

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH

GPU: Commodity Processor

Cell phones Laptops Consoles

PSPDesktops

15

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH

Graphics Processing Units (GPUs)

Commodity processor for graphics applicationsMassively parallel vector processors

10x more operations per sec than CPUs

High memory bandwidthLow memory latency pipelineProgrammable

High growth rate

16

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH

Parallelism on GPUs

Graphics FLOPS

GPU – 1.3 TFLOPS

CPU – 25.6 GFLOPS

17

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH

Graphics Processing Units (GPUs)

Commodity processor for graphics applicationsMassively parallel vector processorsHigh memory bandwidth

Better hides memory latencyProgrammable10x more memory bandwidth than CPUs

High growth rate

18

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH

vertex

setuprasterizer

pixel

texture

image

per-pixel texture, fp16 blending

Graphics Pipeline

programmable vertexprocessing (fp32)

programmable per-pixel math (fp32)

polygonpolygon setup,culling, rasterization

Z-buf, fp16 blending,anti-alias (MRT)

memory

Hides memory latency!!

Lo

w p

ipel

ine

dep

th56 GB/s

19

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH

data

setuprasterizer

data

data

data

data fetch, fp16 blending

NON-Graphics Pipeline Abstraction

programmable MIMDprocessing (fp32)

programmable SIMDprocessing (fp32)

listsSIMD“rasterization”

predicated write, fp16blend, multiple output

memory

Courtesy: David Kirk,Chief Scientist, NVIDIA

20

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH

Graphics Processing Units (GPUs)

Commodity processor for graphics applicationsMassively parallel vector processorsHigh memory bandwidth

Low memory latency pipelineProgrammable

High growth rate

21

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH

Technology Trends: CPU and GPU

2.2GHz

4.4GHz

31 GHz

0.8 GHz

1.6 GHz

11.2

4.2

Lo

g o

f R

elat

ive

Pro

cess

ing

Po

wer

2002 2004 2006 2008

Corporate DT SW Requirements

Moore’s Law Trajectory

CPU

Value

Leading

Edge

Mobile

Mainstream Desktop

DT ‘Replacement’

Enthusiast / Specialty

Cooling (Cost)LimitationsGPU

Moore’s

Law 3 fo

r 18 m

o

Then Moore

’s La

w trajecto

ry

Graphics Req’m

ts

(enhanced experience)

Leading Edge

Value / UMA

?CPU

22

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH

Architecture of Phase 1: GPUTeraSort

23

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH

GPUs for Sorting: Issues

No support for arbitrary writesOptimized CPU algorithms do not map!Requires new algorithms – sorting networks

Lack of support for general data typesOut-of-core algorithms

Limited GPU memory

Difficult to program

24

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH

General Sorting on GPUs

Sorting networks: No data dependencies

Utilize high parallelism on GPUs

To handle large keys, use bitonic radix sort

Perform bitonic sort on the 4 most significant bytes (MSB) using GPUs, compute sorted records with equal 4 MSBs, proceed to the next 4 bytes on those and so onCan handle any length keys

25

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH

GPU-Based Sorting Networks

Represent data as 2D arrays

Multi-stage algorithmEach stage involves multiple steps

In each step1. Compare one array element against exactly

one other element at fixed distance2. Perform a conditional assignment (MIN or

MAX) at each element location

26

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH

Flash animation removed to save (46MB !)

27

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH

2D Memory Addressing

GPUs optimized for 2D representations

Map 1D arrays to 2D arraysMinimum and maximum regions mapped to row-aligned or column-aligned quads

28

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH

1D – 2D Mapping

MIN MAX

29

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH

1D – 2D Mapping

MIN

Effectively reduce instructions per element

30

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH

Sorting on GPU: Pipelining and Parallelism

Input Vertices

Texturing, Caching and 2D Quad

Comparisons

Sequential Writes

31

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH

Comparison with GPU-Based Algorithms

3-6x faster than prior GPU-based algorithms!

32

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH

GPU vs. High-End Multi-Core CPUs

2-2.5x faster thanIntel high-end processors

Single GPU performance comparable tohigh-end dual core Athlon

Hand-optimized CPU code from Intel Corporation!

33

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH

Super-Moore’s Law Growth

50 GB/s on a single GPU

Peak Performance: Effectively hide memory latency with 15 GOP/s

Download URL: http://gamma.cs.unc.edu/GPUSORT

Slash Dot News and Toms Hardware News Headlines

34

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH

Implementation & Results

Pentium IV PC ($170)NVIDIA 7800 GT ($270)2 GB RAM ($152)9 80GB SATA disks ($477)SuperMicro Motherboard & SATA Controller ($325)Windows XP

PC costs $1469

35

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH

Implementation & Results

Indy SortBenchmark10 byte random string keys100 byte long recordsSort maximum amount in 644 seconds

36

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH

Overall Performance

Faster and more scalable than Dual Xeon processors (3.6 GHz)!

37

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH

Performance/$

1.8x faster than current Terabyte sorter

World’s best price-to-performance system

http://research.microsoft.com/barc/SortBenchmark

38

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH

Analysis: I/O Performance

Salzberg Analysis: 100 MB Run Size Peak

sequential throughput in MB/s

39

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH

Analysis: I/O Performance

Pentium IV: 25MB Run Size (to reduce memory latency)

Less work and only 75% IO efficient!

Salzberg Analysis: 100 MB Run Size

40

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH

Analysis: I/O Performance

Dual 3.6 GHz Xeons: 25MB Run size (to reduce memory latency)

More cores, less work but only 85% IO efficient!

Salzberg Analysis: 100 MB Run Size

41

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH

Analysis: I/O Performance

7800 GT: 100MB run size

Ideal work, and 92% IO efficient with single CPU!

Salzberg Analysis: 100 MB Run Size

42

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH

Task Parallelism

Performance limited by IO and memory

Sorting 100MB on GPU

Reorder or Sequential IO

Sorting 100MB on GPU: 3x > reorder or sequential IO

43

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH

Why GPU-like Architectures for Large Data Management?

Plateau: Data Management Performance Crisis

GPU

44

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH

Advantages

Exploit high memory bandwidth on GPUs

Higher memory performance than CPU-based algorithms

High I/O performance due to large run sizes

45

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH

Advantages

Offload work from CPUsCPU cycles well-utilized for resource management

Scalable solution for large databases

Best performance/price solution for terabyte sorting

46

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH

Limitations

May not work well on variable-sized keys and almost sorted databases

Requires programmable GPUs (GPUs manufactured after 2003)

47

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH

Conclusions

Designed new sorting algorithms on GPUs

Handles wide keys and long records

Achieves 10x higher memory performance

Memory efficient sorting algorithm with peak memory performance of (50 GB/s) on GPUs 15 GOP/sec on a single GPU

48

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH

Conclusions

Novel external memory sorting algorithm as a scalable solution

Achieves peak I/O performance on CPUsBest performance/price solution – world’s fastest sorting system

High performance growth rate characteristics

Improve 2-3 times/yr

49

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH

Future Work

Designed high performance/price solutions

High wattage and cooling requirements of CPUs and GPUs

To exploit GPUs, we need easy-to-use programming APIs

Promising directions: BrookGPU, Microsoft Accelerator, Sh, etc.

Scientific libraries utilizing high parallelism and memory bandwidth

Scientific routines on LU, QR, SVD, FFT, etc.BLAS library on GPUsEventually, build GPU-LAPACK and Matlab routines

50

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH

GPUFFTW

N. Govindaraju, S. Larsen, J. Gray and D. Manocha, Proc. of ACM SuperComputing, 2006 (to appear)

Download URL: http://gamma.cs.unc.edu/GPUFFTW

4x faster than IMKL on high-end Quad cores

SlashDot Headlines, May 2006

51

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH

GPU Roadmap

GPUs are becoming more general purpose

Fewer limitations in Microsoft DirectX10 API• Better and consistent floating point

support,• Integer instruction support, • More programmable stages, etc.

Significant advance in performance

GPUs are being widely adopted in commercial applications

Eg. Microsoft Vista

52

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH

Call to Action

Don’t put all your eggs in the Multi-core basketIf you want TeraOps – go where they areIf you want memory bandwidth– go where the memory bandwidth is. CPU-GPU gap is wideningMicrosoft Xbox is ½ TeraOP today.

40 gops

40 gBps

53

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH

Acknowledgements

Research Sponsors:Army Research OfficeDefense and Advanced Research Projects AgencyNational Science FoundationNaval Research LaboratoryIntel CorporationMicrosoft Corporation

Craig Peeper, Peter-Pike Sloan, David Blythe, Jingren Zhou

NVIDIA CorporationRDECOM

54

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH

Acknowledgements

David Tuft (UNC)

UNC Systems, GAMMA and Walkthrough groups

55

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL & MICROSOFT RESEARCH

Thank You

Questions or Comments?

{naga,ritesh,dm}@cs.unc.edu

[email protected]

http://www.cs.unc.edu/~nagahttp://research.microsoft.com/~Gray