stream processing

Post on 01-Jul-2015

538 Views

Category:

Education

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

Presented in CMPUT 429, Fall '11

TRANSCRIPT

Stream Processing

CMPUT 530Arnamoy Bhattacharyya

Instructor: Dr. J Nelson Amaral

Stream processing is a computer programming paradigm, related to SIMD

It allows some applications to more easily exploit a limited form of parallel processing

Stream processing is a computer programming paradigm, related to SIMD

A stream is simply a set of records that require similar computation. Streams provide data parallelism

A stream is simply a set of records that require similar computation. Streams provide data parallelism

Kernels are the functions that are applied to each element in the stream

A stream is simply a set of records that require similar computation. Streams provide data parallelism

Kernels are the functions that are applied to each element in the stream

For each element we can only read from the input, perform operations on it, and write to the output

Stream processing is especially suitable for applications that exhibit three characteristics ---

Stream processing is especially suitable for applications that exhibit three characteristics ---

Compute Intensity

Stream processing is especially suitable for applications that exhibit three characteristics ---

Compute Intensity

Data Parallelism

Stream processing is especially suitable for applications that exhibit three characteristics ---

Compute Intensity

Data Parallelism

Data

Locality

Single Instruction: Only one instruction stream is being acted on by the CPU during any one clock cycle

Single Data: Only one data stream is being used as input during any one clock cycle

Flynn’s Taxonomy: SISD

Single Instruction: All processing units execute the same instruction at any given clock cycle

Multiple Data: Each processing unit can operate on a different data element

Flynn’s Taxonomy: SIMD

Multiple Instruction: Each processing unit operates on the data independently via separate instruction streams.

Single Data: A single data stream is fed into multiple processing units.

Flynn’s Taxonomy: MISD

Multiple Instruction: Every processor may be executing a different instruction stream

Multiple Data: Every processor may be working with a different data stream

Flynn’s Taxonomy: MIMD

stream processing makes use of locality of reference by explicitly grouping related code and data together for easy fetching into the cache

Stream Processors

A stream processing language for programs based on streams of data

e.g Audio, video, DSP, networking, and cryptographic processing kernels

HDTV editing, radar tracking, microphone arrays, cellphone base stations, graphics

[Thies 2002]

A high-level, architecture-independent language for streaming applications

1. Improves programmer productivity (vs. Java, C)

2. Offers scalable performance on multicores

[Thies 2002]

GPU is a single-chip processor that creates lighting effects and transforms objects every time a 3D scene is redrawn

Used primarily for 3-D applications.

a GPU can be present on a video card, or it can be on the motherboard, or in certain CPUs, on the CPU die

GPU

Nvidia in 1999 marketed the GeForce 256 as "the world's first 'GPU, a single-chip processor that is capable of processing a minimum of 10 million polygons per second".

Rival ATI Technologies coined the term visual processing unit or VPU with the release of the Radeon 9700 in 2002.

World’s First GPU

GPUs have a very high compute capacity

GPUs have a very high compute capacity

GPUs have a very high compute capacity

To the hardware, the accelerator looks like another IO unit; it communicates with the CPU using IO commands and DMA memory transfers

To the software, the accelerator is another computer to which your program sends data and routines to execute

To the hardware, the accelerator looks like another IO unit; it communicates with the CPU using IO commands and DMA memory transfers

GPUs have a very high compute capacity

GPGPUThis concept turns the massive floating-point computational power of a modern graphics accelerator into general-purpose computing power

GPGPUThis concept turns the massive floating-point computational power of a modern graphics accelerator into general-purpose computing power

GPUs can only process

independent vertices

and fragments, but can

process many of them in

parallel

GPGPUThis concept turns the massive floating-point computational power of a modern graphics accelerator into general-purpose computing power

GPUs can only process

independent vertices

and fragments, but can

process many of them in

parallel

GPGPU applications to have high arithmetic intensity

GPGPUThis concept turns the massive floating-point computational power of a modern graphics accelerator into general-purpose computing power

GPUs can only process

independent vertices

and fragments, but can

process many of them in

parallel

GPGPU applications to have high arithmetic intensity

GPUs are stream processors – processors that can operate in parallel by running a single kernel on many records in a stream at once

GPGPUThis concept turns the massive floating-point computational power of a modern graphics accelerator into general-purpose computing power

GPUs can only process

independent vertices

and fragments, but can

process many of them in

parallel

GPGPU applications to have high arithmetic intensity

GPUs are stream processors – processors that can operate in parallel by running a single kernel on many records in a stream at once

Ideal GPGPU applications have large data sets, high parallelism, and minimal dependency between data elements

In certain circumstances the GPU calculates

forty times faster than the conventional CPUs

AMD Athlon 64 X2

CPU 154 m

In certain circumstances the GPU calculates

forty times faster than the conventional CPUs

AMD Athlon 64 X2

CPU 154 mATI X1950 XTX GPU 384 m

In certain circumstances the GPU calculates

forty times faster than the conventional CPUs

AMD Athlon 64 X2

CPU 154 mATI X1950 XTX GPU 384 m

Intel Core 2 Quad CPU 582 m

In certain circumstances the GPU calculates

forty times faster than the conventional CPUs

NVIDIA G8800 GTX GPU 680 m

AMD Athlon 64 X2

CPU 154 mATI X1950 XTX GPU 384 m

Intel Core 2 Quad CPU 582 m

In certain circumstances the GPU calculates

forty times faster than the conventional CPUs

“The processing power of just 5,000 ATI processors is also enough to rival that of the existing 200,000 computers currently involved in the Folding@home project”

[Ref 1]

“The processing power of just 5,000 ATI processors is also enough to rival that of the existing 200,000 computers currently involved in the Folding@home project”

“..it is estimated that if a mere 10,000 computers were to each use an ATI processor to conduct folding research, that the Folding@home program would effectively perform faster than the fastest supercomputer in existence today, surpassing the 1 petaFLOP level “- 2007

November 10, 2011- Folding@home 6.0 petaFlop where 8.162 petaFLOP ( K computer) [Ref 1]

comparing GPUs to CPUs isn't an apples-to-apples comparison

The clock rates are lower

the architectures are radically different

the problems they're trying to solve are almost completely unrelated

Application Processor:

Executes application code like MPEG decoding

Sequences the instructions and issues them to Stream clients e.g KEU andDRAM interface

[Kapasi 2003]

Two Stream Clients:

KEU:Programmable Kernel Execution Unit

DRAM interface:

Provides access to global data storage

[Kapasi 2003]

KEU:

It has two stream level instructions:

1. load_kernel – loads compiled kernel function in the local instruction storage inside the KEU

2. run_kernel – executes the kernel

[Kapasi 2003]

DRAM interface:

Two stream level instructions as well –

1. load_stream – loads an entire stream from SRF

2. store_stream – stores a stream into SRF

[Kapasi 2003]

Local register files (LRFs)

1. use for operands for arithmetic operations (similar to caches on CPUs)

2. exploit fine-grain locality

[Kapasi 2003]

Stream register files (SRFs)

1. capture coarse-grain locality

2. efficiently transfer data to and from the LRFs

[Kapasi 2003]

[Kapasi 2003]

Topics learnt today:

1.Stream Processing

3. How modern GPUs use stream processing

2. StreamIT language from MIT

4. Imagine Stream Processor from Stanford

References:

1. http://folding.stanford.edu/2. Kapasi, U. J., Rixner, S., Dally, W. J., Khailany, B., Ahn,J. H., Mattson, P., and Owens, J. D., “Programmable StreamProcessors,” IEEE Computer, August 2003.3. THIES, W., KARCZMAREK, M., AND AMARASINGHE, S. 2002.StreamIt: a language for streaming applications. In Proc. Intl.Conf. on Compiler Construction.

top related