stream processing

100
Stream Processing CMPUT 530 Arnamoy Bhattacharyya Instructor: Dr. J Nelson Amara

Upload: arnamoy10

Post on 01-Jul-2015

538 views

Category:

Education


1 download

DESCRIPTION

Presented in CMPUT 429, Fall '11

TRANSCRIPT

Page 1: Stream Processing

Stream Processing

CMPUT 530Arnamoy Bhattacharyya

Instructor: Dr. J Nelson Amaral

Page 2: Stream Processing

Stream processing is a computer programming paradigm, related to SIMD

Page 3: Stream Processing

It allows some applications to more easily exploit a limited form of parallel processing

Stream processing is a computer programming paradigm, related to SIMD

Page 4: Stream Processing

A stream is simply a set of records that require similar computation. Streams provide data parallelism

Page 5: Stream Processing

A stream is simply a set of records that require similar computation. Streams provide data parallelism

Kernels are the functions that are applied to each element in the stream

Page 6: Stream Processing

A stream is simply a set of records that require similar computation. Streams provide data parallelism

Kernels are the functions that are applied to each element in the stream

For each element we can only read from the input, perform operations on it, and write to the output

Page 7: Stream Processing

Stream processing is especially suitable for applications that exhibit three characteristics ---

Page 8: Stream Processing

Stream processing is especially suitable for applications that exhibit three characteristics ---

Compute Intensity

Page 9: Stream Processing

Stream processing is especially suitable for applications that exhibit three characteristics ---

Compute Intensity

Data Parallelism

Page 10: Stream Processing

Stream processing is especially suitable for applications that exhibit three characteristics ---

Compute Intensity

Data Parallelism

Data

Locality

Page 11: Stream Processing

Single Instruction: Only one instruction stream is being acted on by the CPU during any one clock cycle

Single Data: Only one data stream is being used as input during any one clock cycle

Flynn’s Taxonomy: SISD

Page 12: Stream Processing

Single Instruction: All processing units execute the same instruction at any given clock cycle

Multiple Data: Each processing unit can operate on a different data element

Flynn’s Taxonomy: SIMD

Page 13: Stream Processing

Multiple Instruction: Each processing unit operates on the data independently via separate instruction streams.

Single Data: A single data stream is fed into multiple processing units.

Flynn’s Taxonomy: MISD

Page 14: Stream Processing

Multiple Instruction: Every processor may be executing a different instruction stream

Multiple Data: Every processor may be working with a different data stream

Flynn’s Taxonomy: MIMD

Page 15: Stream Processing

stream processing makes use of locality of reference by explicitly grouping related code and data together for easy fetching into the cache

Stream Processors

Page 16: Stream Processing

A stream processing language for programs based on streams of data

e.g Audio, video, DSP, networking, and cryptographic processing kernels

HDTV editing, radar tracking, microphone arrays, cellphone base stations, graphics

[Thies 2002]

Page 17: Stream Processing

A high-level, architecture-independent language for streaming applications

1. Improves programmer productivity (vs. Java, C)

2. Offers scalable performance on multicores

[Thies 2002]

Page 18: Stream Processing
Page 19: Stream Processing
Page 20: Stream Processing
Page 21: Stream Processing
Page 22: Stream Processing
Page 23: Stream Processing
Page 24: Stream Processing
Page 25: Stream Processing
Page 26: Stream Processing
Page 27: Stream Processing
Page 28: Stream Processing
Page 29: Stream Processing
Page 30: Stream Processing
Page 31: Stream Processing
Page 32: Stream Processing
Page 33: Stream Processing
Page 34: Stream Processing
Page 35: Stream Processing
Page 36: Stream Processing
Page 37: Stream Processing
Page 38: Stream Processing
Page 39: Stream Processing
Page 40: Stream Processing
Page 41: Stream Processing
Page 42: Stream Processing
Page 43: Stream Processing
Page 44: Stream Processing
Page 45: Stream Processing
Page 46: Stream Processing
Page 47: Stream Processing
Page 48: Stream Processing
Page 49: Stream Processing
Page 50: Stream Processing
Page 51: Stream Processing
Page 52: Stream Processing
Page 53: Stream Processing
Page 54: Stream Processing
Page 55: Stream Processing
Page 56: Stream Processing
Page 57: Stream Processing
Page 58: Stream Processing
Page 59: Stream Processing
Page 60: Stream Processing
Page 61: Stream Processing
Page 62: Stream Processing
Page 63: Stream Processing
Page 64: Stream Processing
Page 65: Stream Processing
Page 66: Stream Processing
Page 67: Stream Processing
Page 68: Stream Processing
Page 69: Stream Processing
Page 70: Stream Processing
Page 71: Stream Processing
Page 72: Stream Processing

GPU is a single-chip processor that creates lighting effects and transforms objects every time a 3D scene is redrawn

Used primarily for 3-D applications.

a GPU can be present on a video card, or it can be on the motherboard, or in certain CPUs, on the CPU die

GPU

Page 73: Stream Processing

Nvidia in 1999 marketed the GeForce 256 as "the world's first 'GPU, a single-chip processor that is capable of processing a minimum of 10 million polygons per second".

Rival ATI Technologies coined the term visual processing unit or VPU with the release of the Radeon 9700 in 2002.

World’s First GPU

Page 74: Stream Processing

GPUs have a very high compute capacity

Page 75: Stream Processing

GPUs have a very high compute capacity

Page 76: Stream Processing

GPUs have a very high compute capacity

To the hardware, the accelerator looks like another IO unit; it communicates with the CPU using IO commands and DMA memory transfers

Page 77: Stream Processing

To the software, the accelerator is another computer to which your program sends data and routines to execute

To the hardware, the accelerator looks like another IO unit; it communicates with the CPU using IO commands and DMA memory transfers

GPUs have a very high compute capacity

Page 78: Stream Processing

GPGPUThis concept turns the massive floating-point computational power of a modern graphics accelerator into general-purpose computing power

Page 79: Stream Processing

GPGPUThis concept turns the massive floating-point computational power of a modern graphics accelerator into general-purpose computing power

GPUs can only process

independent vertices

and fragments, but can

process many of them in

parallel

Page 80: Stream Processing

GPGPUThis concept turns the massive floating-point computational power of a modern graphics accelerator into general-purpose computing power

GPUs can only process

independent vertices

and fragments, but can

process many of them in

parallel

GPGPU applications to have high arithmetic intensity

Page 81: Stream Processing

GPGPUThis concept turns the massive floating-point computational power of a modern graphics accelerator into general-purpose computing power

GPUs can only process

independent vertices

and fragments, but can

process many of them in

parallel

GPGPU applications to have high arithmetic intensity

GPUs are stream processors – processors that can operate in parallel by running a single kernel on many records in a stream at once

Page 82: Stream Processing

GPGPUThis concept turns the massive floating-point computational power of a modern graphics accelerator into general-purpose computing power

GPUs can only process

independent vertices

and fragments, but can

process many of them in

parallel

GPGPU applications to have high arithmetic intensity

GPUs are stream processors – processors that can operate in parallel by running a single kernel on many records in a stream at once

Ideal GPGPU applications have large data sets, high parallelism, and minimal dependency between data elements

Page 83: Stream Processing

In certain circumstances the GPU calculates

forty times faster than the conventional CPUs

Page 84: Stream Processing

AMD Athlon 64 X2

CPU 154 m

In certain circumstances the GPU calculates

forty times faster than the conventional CPUs

Page 85: Stream Processing

AMD Athlon 64 X2

CPU 154 mATI X1950 XTX GPU 384 m

In certain circumstances the GPU calculates

forty times faster than the conventional CPUs

Page 86: Stream Processing

AMD Athlon 64 X2

CPU 154 mATI X1950 XTX GPU 384 m

Intel Core 2 Quad CPU 582 m

In certain circumstances the GPU calculates

forty times faster than the conventional CPUs

Page 87: Stream Processing

NVIDIA G8800 GTX GPU 680 m

AMD Athlon 64 X2

CPU 154 mATI X1950 XTX GPU 384 m

Intel Core 2 Quad CPU 582 m

In certain circumstances the GPU calculates

forty times faster than the conventional CPUs

Page 88: Stream Processing

“The processing power of just 5,000 ATI processors is also enough to rival that of the existing 200,000 computers currently involved in the Folding@home project”

[Ref 1]

Page 89: Stream Processing

“The processing power of just 5,000 ATI processors is also enough to rival that of the existing 200,000 computers currently involved in the Folding@home project”

“..it is estimated that if a mere 10,000 computers were to each use an ATI processor to conduct folding research, that the Folding@home program would effectively perform faster than the fastest supercomputer in existence today, surpassing the 1 petaFLOP level “- 2007

November 10, 2011- Folding@home 6.0 petaFlop where 8.162 petaFLOP ( K computer) [Ref 1]

Page 90: Stream Processing

comparing GPUs to CPUs isn't an apples-to-apples comparison

The clock rates are lower

the architectures are radically different

the problems they're trying to solve are almost completely unrelated

Page 91: Stream Processing
Page 92: Stream Processing

Application Processor:

Executes application code like MPEG decoding

Sequences the instructions and issues them to Stream clients e.g KEU andDRAM interface

[Kapasi 2003]

Page 93: Stream Processing

Two Stream Clients:

KEU:Programmable Kernel Execution Unit

DRAM interface:

Provides access to global data storage

[Kapasi 2003]

Page 94: Stream Processing

KEU:

It has two stream level instructions:

1. load_kernel – loads compiled kernel function in the local instruction storage inside the KEU

2. run_kernel – executes the kernel

[Kapasi 2003]

Page 95: Stream Processing

DRAM interface:

Two stream level instructions as well –

1. load_stream – loads an entire stream from SRF

2. store_stream – stores a stream into SRF

[Kapasi 2003]

Page 96: Stream Processing

Local register files (LRFs)

1. use for operands for arithmetic operations (similar to caches on CPUs)

2. exploit fine-grain locality

[Kapasi 2003]

Page 97: Stream Processing

Stream register files (SRFs)

1. capture coarse-grain locality

2. efficiently transfer data to and from the LRFs

[Kapasi 2003]

Page 98: Stream Processing

[Kapasi 2003]

Page 99: Stream Processing

Topics learnt today:

1.Stream Processing

3. How modern GPUs use stream processing

2. StreamIT language from MIT

4. Imagine Stream Processor from Stanford

Page 100: Stream Processing

References:

1. http://folding.stanford.edu/2. Kapasi, U. J., Rixner, S., Dally, W. J., Khailany, B., Ahn,J. H., Mattson, P., and Owens, J. D., “Programmable StreamProcessors,” IEEE Computer, August 2003.3. THIES, W., KARCZMAREK, M., AND AMARASINGHE, S. 2002.StreamIt: a language for streaming applications. In Proc. Intl.Conf. on Compiler Construction.