implementing boolean matrix multiplication on a gpu · i per pixel e ects. i the same function for...

Implementing Boolean matrix multiplication on a GPU

Alexander Okhotin

Department of Mathematics, University of Turku, FinlandAcademy of Finland

DESY, Hamburg, Germany12 April 2010 A. D.

Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 1 / 18

Background

High-performance hardware is parallel.

Most algorithms are (partially) sequential.

Find the bottleneck and parallelize it.

The speaker’s case:Syntax analysis for general context-free grammars.

I Sequential nature.I Typically implemented combinatorially.I Can be done via Boolean matrix multiplication.

F Valiant (1975): theoretical bound.F Okhotin (2010): refactored and generalized.

X Efficiently parallelized.

Implementing on a Graphics Processing Unit.

Background

I Sequential nature.

I Typically implemented combinatorially.I Can be done via Boolean matrix multiplication.

Background

I Sequential nature.I Typically implemented combinatorially.

I Can be done via Boolean matrix multiplication.

Background

F Valiant (1975): theoretical bound.

F Okhotin (2010): refactored and generalized.

Background

Part I

GPU programming

Graphics Processing Units

Designed for 3D graphics in computer games.

I Shading.

I Texturing.

I Per pixel effects.

I The same function for each pixel.

I Function as a kernel (program).

I Pixel as a work item.General purpose computation on GPUs.

I Tens of cores, each with multiple ALUs.I Approaching 1 Teraflop.I Priced as a consumer toy.

Best price to performance ratio.

Special programming techniques.

Designed for 3D graphics in computer games.I Shading.

I Texturing.

I Pixel as a work item.

General purpose computation on GPUs.

I Texturing.

I Tens of cores, each with multiple ALUs.

I Approaching 1 Teraflop.I Priced as a consumer toy.

I Texturing.

I Tens of cores, each with multiple ALUs.I Approaching 1 Teraflop.

I Priced as a consumer toy.

I Texturing.

GPU programming

Proprietary interfaces: NVIDIA CUDA, ATI Stream.

Device-independent language: OpenCL.

I Supported by NVIDIA and ATI drivers.I CPU implementation.

Kernel: program running on GPU.

I Dialect of C.I Computes one “work item”.I Executed for a grid of work items.

Host code running on a CPU.

I Allocate GPU memory.I Load and compile a kernel.I Give arguments to the kernel.

GPU programming

Device-independent language: OpenCL.

I Supported by NVIDIA and ATI drivers.I CPU implementation.

GPU programming

Device-independent language: OpenCL.I Supported by NVIDIA and ATI drivers.

I CPU implementation.

GPU programming

Device-independent language: OpenCL.I Supported by NVIDIA and ATI drivers.I CPU implementation.

GPU programming

Kernel: program running on GPU.I Dialect of C.

I Computes one “work item”.I Executed for a grid of work items.

GPU programming

Kernel: program running on GPU.I Dialect of C.I Computes one “work item”.

I Executed for a grid of work items.

GPU programming

Kernel: program running on GPU.I Dialect of C.I Computes one “work item”.I Executed for a grid of work items.

GPU programming

Host code running on a CPU.I Allocate GPU memory.

I Load and compile a kernel.I Give arguments to the kernel.

GPU programming

Host code running on a CPU.I Allocate GPU memory.I Load and compile a kernel.

I Give arguments to the kernel.

GPU programming

Host code running on a CPU.I Allocate GPU memory.I Load and compile a kernel.I Give arguments to the kernel.

Execution and memory model

2–32 multithreaded cores, each with 8–16 ALUs.

Many threads running on a core, grouped into warps.

Main system memory (“host memory”): accessed through the bus.

Global memory: accessed by all GPU cores (up to 150 Gb/s).

I 64–512-bit bus.I Multiple threads would better access adjacent words.

Local memory: shared by all threads on a core.

I Much faster.I Often used to cache data.

Private memory, owned by a thread.

Computation divided into work-items.

I 1d, 2d or 3d grid of work-items.I Block of work-items: work-group.

Global memory: accessed by all GPU cores (up to 150 Gb/s).I 64–512-bit bus.

I Multiple threads would better access adjacent words.

Global memory: accessed by all GPU cores (up to 150 Gb/s).I 64–512-bit bus.I Multiple threads would better access adjacent words.

Local memory: shared by all threads on a core.I Much faster.

I Often used to cache data.

Local memory: shared by all threads on a core.I Much faster.I Often used to cache data.

Computation divided into work-items.I 1d, 2d or 3d grid of work-items.

I Block of work-items: work-group.

Computation divided into work-items.I 1d, 2d or 3d grid of work-items.I Block of work-items: work-group.

Primitive example

Example (Jacobi method)

1 Compile the program.

2 Allocate n*n*sizeof(float) bytes for A and B.

3 Create kernel with arguments (n, n,A,B).

4 Invoke with work items {0, . . . , n − 3} × {0, . . . , n − 3}.5 Wait for termination.

It works.

. . . though very inefficiently:I Reading 4 times.I Memory alignment ignored.

Primitive example

It works.

Primitive example

It works.

Primitive example

4 Invoke with work items {0, . . . , n − 3} × {0, . . . , n − 3}.

5 Wait for termination.

It works.

Primitive example

It works.

Primitive example

It works.

Primitive example

It works. . . . though very inefficiently:

I Reading 4 times.I Memory alignment ignored.

Primitive example

It works. . . . though very inefficiently:I Reading 4 times.

I Memory alignment ignored.

Primitive example

It works. . . . though very inefficiently:I Reading 4 times.I Memory alignment ignored.

Part II

Boolean matrix multiplication

Matrix multiplication as such

S: a semiring.

A ∈ Sm×`, B ∈ S`×n,

Their product, C ∈ Sm×n:

Ci ,j =∑k=1

Ai ,k · Bk,j

`mn multiplications, (`− 1)mn additions.

X In this talk:

S: {0, 1} = B;Sum: disjunction;

Product: conjunction;Square matrices: m = n = k.

Θ(n3) bit operations.

S: a semiring.

A ∈ Sm×`, B ∈ S`×n,

Ci ,j =∑k=1

Ai ,k · Bk,j

X In this talk:

S: a semiring.

A ∈ Sm×`, B ∈ S`×n,

Ci ,j =∑k=1

Ai ,k · Bk,j

X In this talk:

S: a semiring.

A ∈ Sm×`, B ∈ S`×n,

Ci ,j =∑k=1

Ai ,k · Bk,j

X In this talk:

S: a semiring.

A ∈ Sm×`, B ∈ S`×n,

Ci ,j =∑k=1

Ai ,k · Bk,j

X In this talk:

S: a semiring.

A ∈ Sm×`, B ∈ S`×n,

Ci ,j =∑k=1

Ai ,k · Bk,j

X In this talk:

S: {0, 1} = B;

Sum: disjunction;Product: conjunction;

Square matrices: m = n = k.

S: a semiring.

A ∈ Sm×`, B ∈ S`×n,

Ci ,j =∑k=1

Ai ,k · Bk,j

X In this talk:

S: a semiring.

A ∈ Sm×`, B ∈ S`×n,

Ci ,j =∑k=1

Ai ,k · Bk,j

X In this talk:

Product: conjunction;

Square matrices: m = n = k.

S: a semiring.

A ∈ Sm×`, B ∈ S`×n,

Ci ,j =∑k=1

Ai ,k · Bk,j

X In this talk:

S: a semiring.

A ∈ Sm×`, B ∈ S`×n,

Ci ,j =∑k=1

Ai ,k · Bk,j

X In this talk:

Fast matrix multiplication over a ring

# of multiplications for 2× 2 matrices?

8(a11 a12a21 a22

b11 b12b21 b22

(a11b11 + a12b21 a11b12 + a12b22a21b11 + a22b21 a21b12 + a22b22

Assume S is a ring.

I ∀x ∈ S ∃(−x) ∈ S : x + (−x) = 0

Strassen (1969): 2× 2 matrices using 7 multiplications.

I First, compute 14 linear combinations.I Second, calculate their products.I Their linear combinations yield the results.

I Larger matrices: as block matrices.

(A11 A12

A21 A22

B11 B12

B21 B22

I O(nlog2 7) operations for n × n matrices.

Coppersmith and Winograd (1990): O(n2.376) operations.

X (B,∧,∨) is not a ring.

# of multiplications for 2× 2 matrices? 8(a11 a12a21 a22

b11 b12b21 b22

Assume S is a ring.

I ∀x ∈ S ∃(−x) ∈ S : x + (−x) = 0

(A11 A12

A21 A22

B11 B12

B21 B22

b11 b12b21 b22

Assume S is a ring.

I ∀x ∈ S ∃(−x) ∈ S : x + (−x) = 0

(A11 A12

A21 A22

B11 B12

B21 B22

b11 b12b21 b22

Assume S is a ring.I ∀x ∈ S ∃(−x) ∈ S : x + (−x) = 0

(A11 A12

A21 A22

B11 B12

B21 B22

b11 b12b21 b22

(A11 A12

A21 A22

B11 B12

B21 B22

b11 b12b21 b22

Strassen (1969): 2× 2 matrices using 7 multiplications.I First, compute 14 linear combinations.

I Second, calculate their products.I Their linear combinations yield the results.

(A11 A12

A21 A22

B11 B12

B21 B22

b11 b12b21 b22

Strassen (1969): 2× 2 matrices using 7 multiplications.I First, compute 14 linear combinations.I Second, calculate their products.

I Their linear combinations yield the results.

(A11 A12

A21 A22

B11 B12

B21 B22

b11 b12b21 b22

Strassen (1969): 2× 2 matrices using 7 multiplications.I First, compute 14 linear combinations.I Second, calculate their products.I Their linear combinations yield the results.

(A11 A12

A21 A22

B11 B12

B21 B22

b11 b12b21 b22

(A11 A12

A21 A22

B11 B12

B21 B22

b11 b12b21 b22

(A11 A12

A21 A22

B11 B12

B21 B22

b11 b12b21 b22

(A11 A12

A21 A22

B11 B12

B21 B22

b11 b12b21 b22

(A11 A12

A21 A22

B11 B12

B21 B22

Applying fast matrix multiplication to the Boolean semiring

n × n Boolean matrices.

Multiplying them in Zn+1.

(1 01 1

0 11 1

(0 11 2

)︸︷︷︸

(0 11 1

)︸︷︷︸

One bit → dlogn+1e bits.

Multiplying them in Zn+1.

(1 01 1

0 11 1

(0 11 2

)︸︷︷︸

(0 11 1

)︸︷︷︸

Multiplying them in Zn+1.(1 01 1

0 11 1

(0 11 2

)︸︷︷︸

(0 11 1

)︸︷︷︸

0 11 1

(0 11 2

)︸︷︷︸

(0 11 1

)︸︷︷︸

0 11 1

(0 11 2

)︸︷︷︸

(0 11 1

)︸︷︷︸

0 11 1

(0 11 2

)︸︷︷︸

(0 11 1

)︸︷︷︸

An O( n3

log n) method for Boolean matrices

Arlazarov et al. (1970)

Fix k << n.Multiplying 1× k blocks of A by k × n blocks of B.

At most 2k different 1× k blocks.Pre-compute all 2k products with each of block of B ( nk blocks).Look up n bits for each 1× k block of A,Time complexity:

2k · nk· n︸︷︷︸

making the table

k︸︷︷︸multiplication

log n operations for k = log n.

An O( n3

Arlazarov et al. (1970)Fix k << n.

Multiplying 1× k blocks of A by k × n blocks of B.

2k · nk· n︸︷︷︸

making the table

An O( n3

Arlazarov et al. (1970)Fix k << n.Multiplying 1× k blocks of A by k × n blocks of B.

2k · nk· n︸︷︷︸

making the table

An O( n3

2k · nk· n︸︷︷︸

making the table

An O( n3

At most 2k different 1× k blocks.

Pre-compute all 2k products with each of block of B ( nk blocks).Look up n bits for each 1× k block of A,Time complexity:

2k · nk· n︸︷︷︸

making the table

An O( n3

At most 2k different 1× k blocks.Pre-compute all 2k products with each of block of B ( nk blocks).

Look up n bits for each 1× k block of A,Time complexity:

2k · nk· n︸︷︷︸

making the table

An O( n3

At most 2k different 1× k blocks.Pre-compute all 2k products with each of block of B ( nk blocks).Look up n bits for each 1× k block of A,

Time complexity:

2k · nk· n︸︷︷︸

making the table

An O( n3

2k · nk· n︸︷︷︸

making the table

An O( n3

2k · nk· n︸︷︷︸

making the table

log n operations for k = log n.Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 12 / 18

Part III

Boolean matrix multiplication on a GPU

Joint work with Christian Reitwießner (Wurzburg)

Main performance considerations

Matrices A,B ∈ Bn×n are on the CPU:

I either multiply them on the CPU,I or send to the GPU (and use which method?).

If n < 200, faster to multiply than to transfer.

If n > 50000, will not fit on the GPU.

I Processing by parts.

Direct n3 multiplication.

I For n > 100 already superceded.

Arlazarov et al.: n3

log n operations.

I Basic operation: union of rows.I Works well on GPU.

Strassen’s method: O(nlog2 7).

I Have to multiply ints instead of bits!I Inductive on n, reducing to many small matrices.

Matrices A,B ∈ Bn×n are on the CPU:I either multiply them on the CPU,

I or send to the GPU (and use which method?).

log n operations.

Matrices A,B ∈ Bn×n are on the CPU:I either multiply them on the CPU,I or send to the GPU (and use which method?).

log n operations.

If n > 50000, will not fit on the GPU.I Processing by parts.

log n operations.

Direct n3 multiplication.I For n > 100 already superceded.

log n operations.

log n operations.I Basic operation: union of rows.

I Works well on GPU.

log n operations.I Basic operation: union of rows.I Works well on GPU.

Strassen’s method: O(nlog2 7).I Have to multiply ints instead of bits!

I Inductive on n, reducing to many small matrices.

Strassen’s method: O(nlog2 7).I Have to multiply ints instead of bits!I Inductive on n, reducing to many small matrices.

The O( n3

log n) method on a GPUMaking a table for B

Matrix B ∈ Bn×n on the GPU.

For each block of lines i ∈ {0, . . . , nk − 1},create table T [i ] ∈ B2k×n.

Line (bk−1 . . . b1b0)2 in T :disjunction of all lines with bj = 1.

Work items: every 64 bits in each line.

I 2k disjunctions of longs.I Threads access adjacent words.

Another dimension: T [i ] for different i .

The O( n3

Work items: every 64 bits in each line.I 2k disjunctions of longs.

I Threads access adjacent words.

The O( n3

Work items: every 64 bits in each line.I 2k disjunctions of longs.I Threads access adjacent words.

The O( n3

Work items: every 64 bits in each line.I 2k disjunctions of longs.I Threads access adjacent words.

The O( n3

log n) method on a GPUMultiplying the matrices

Matrix A ∈ Bn×n on the GPU.

nk tables T [i ] ∈ B2k×n on the GPU.

Compute the product A× B.

Work items: lines of A (and C ).

Step 1: cache the line of A to local memory.

Block-column of A determines the number of the table.

1× k block of A indexes the table.

Disjunction with the line of C .

Second dimension: every 64 bits in each line of T and C .

The O( n3

Matrix A ∈ Bn×n on the GPU.nk tables T [i ] ∈ B2k×n on the GPU.

The O( n3

Performance

n = 2048, k = 8.

CPUNvidia G210M

(low-end laptop GPU)

Nvidia GTS250(average gaming card)

234 ms 17.4 ms 3.3 ms

Memory access

9.4 GB/s 51.9 GB/s

Basically, bandwidth-limited.

The cores could compute more!

Optimization: cache more in local memory.

I Local memory: usually 16 KB per core.I Compute the table by parts.

Performance

n = 2048, k = 8.

CPUNvidia G210M

234 ms 17.4 ms 3.3 ms

Memory access

9.4 GB/s 51.9 GB/s

Performance

n = 2048, k = 8.

CPUNvidia G210M

Time 234 ms 17.4 ms 3.3 msMemory access

9.4 GB/s 51.9 GB/s

Performance

n = 2048, k = 8.

CPUNvidia G210M

Time 234 ms 17.4 ms 3.3 msMemory access 9.4 GB/s 51.9 GB/s

Performance

n = 2048, k = 8.

CPUNvidia G210M

Performance

n = 2048, k = 8.

CPUNvidia G210M

Performance

n = 2048, k = 8.

CPUNvidia G210M

Performance

n = 2048, k = 8.

CPUNvidia G210M

Optimization: cache more in local memory.I Local memory: usually 16 KB per core.

I Compute the table by parts.

Performance

n = 2048, k = 8.

CPUNvidia G210M

Optimization: cache more in local memory.I Local memory: usually 16 KB per core.I Compute the table by parts.

Future work in this project

1 Refactor to use more local memory.

2 Implement multiplication of huge matrices.

3 Do a practical comparison with Strassen.

4 For the parsing application:better performance on smaller matrices.

I Large matrices are handled fast enough.I 128x128 and 256x256 matrices dominate the running time.

I Large matrices are handled fast enough.

I 128x128 and 256x256 matrices dominate the running time.

implementing boolean matrix multiplication on a gpu · i per pixel e ects. i the same function for...

Documents