implementing boolean matrix multiplication on a gpu · i per pixel e ects. i the same function for...
TRANSCRIPT
Implementing Boolean matrix multiplication on a GPU
Alexander Okhotin
Department of Mathematics, University of Turku, FinlandAcademy of Finland
DESY, Hamburg, Germany12 April 2010 A. D.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 1 / 18
Background
High-performance hardware is parallel.
Most algorithms are (partially) sequential.
Find the bottleneck and parallelize it.
The speaker’s case:Syntax analysis for general context-free grammars.
I Sequential nature.I Typically implemented combinatorially.I Can be done via Boolean matrix multiplication.
F Valiant (1975): theoretical bound.F Okhotin (2010): refactored and generalized.
X Efficiently parallelized.
Implementing on a Graphics Processing Unit.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 2 / 18
Background
High-performance hardware is parallel.
Most algorithms are (partially) sequential.
Find the bottleneck and parallelize it.
The speaker’s case:Syntax analysis for general context-free grammars.
I Sequential nature.I Typically implemented combinatorially.I Can be done via Boolean matrix multiplication.
F Valiant (1975): theoretical bound.F Okhotin (2010): refactored and generalized.
X Efficiently parallelized.
Implementing on a Graphics Processing Unit.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 2 / 18
Background
High-performance hardware is parallel.
Most algorithms are (partially) sequential.
Find the bottleneck and parallelize it.
The speaker’s case:Syntax analysis for general context-free grammars.
I Sequential nature.I Typically implemented combinatorially.I Can be done via Boolean matrix multiplication.
F Valiant (1975): theoretical bound.F Okhotin (2010): refactored and generalized.
X Efficiently parallelized.
Implementing on a Graphics Processing Unit.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 2 / 18
Background
High-performance hardware is parallel.
Most algorithms are (partially) sequential.
Find the bottleneck and parallelize it.
The speaker’s case:Syntax analysis for general context-free grammars.
I Sequential nature.I Typically implemented combinatorially.I Can be done via Boolean matrix multiplication.
F Valiant (1975): theoretical bound.F Okhotin (2010): refactored and generalized.
X Efficiently parallelized.
Implementing on a Graphics Processing Unit.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 2 / 18
Background
High-performance hardware is parallel.
Most algorithms are (partially) sequential.
Find the bottleneck and parallelize it.
The speaker’s case:Syntax analysis for general context-free grammars.
I Sequential nature.
I Typically implemented combinatorially.I Can be done via Boolean matrix multiplication.
F Valiant (1975): theoretical bound.F Okhotin (2010): refactored and generalized.
X Efficiently parallelized.
Implementing on a Graphics Processing Unit.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 2 / 18
Background
High-performance hardware is parallel.
Most algorithms are (partially) sequential.
Find the bottleneck and parallelize it.
The speaker’s case:Syntax analysis for general context-free grammars.
I Sequential nature.I Typically implemented combinatorially.
I Can be done via Boolean matrix multiplication.
F Valiant (1975): theoretical bound.F Okhotin (2010): refactored and generalized.
X Efficiently parallelized.
Implementing on a Graphics Processing Unit.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 2 / 18
Background
High-performance hardware is parallel.
Most algorithms are (partially) sequential.
Find the bottleneck and parallelize it.
The speaker’s case:Syntax analysis for general context-free grammars.
I Sequential nature.I Typically implemented combinatorially.I Can be done via Boolean matrix multiplication.
F Valiant (1975): theoretical bound.F Okhotin (2010): refactored and generalized.
X Efficiently parallelized.
Implementing on a Graphics Processing Unit.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 2 / 18
Background
High-performance hardware is parallel.
Most algorithms are (partially) sequential.
Find the bottleneck and parallelize it.
The speaker’s case:Syntax analysis for general context-free grammars.
I Sequential nature.I Typically implemented combinatorially.I Can be done via Boolean matrix multiplication.
F Valiant (1975): theoretical bound.
F Okhotin (2010): refactored and generalized.
X Efficiently parallelized.
Implementing on a Graphics Processing Unit.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 2 / 18
Background
High-performance hardware is parallel.
Most algorithms are (partially) sequential.
Find the bottleneck and parallelize it.
The speaker’s case:Syntax analysis for general context-free grammars.
I Sequential nature.I Typically implemented combinatorially.I Can be done via Boolean matrix multiplication.
F Valiant (1975): theoretical bound.F Okhotin (2010): refactored and generalized.
X Efficiently parallelized.
Implementing on a Graphics Processing Unit.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 2 / 18
Background
High-performance hardware is parallel.
Most algorithms are (partially) sequential.
Find the bottleneck and parallelize it.
The speaker’s case:Syntax analysis for general context-free grammars.
I Sequential nature.I Typically implemented combinatorially.I Can be done via Boolean matrix multiplication.
F Valiant (1975): theoretical bound.F Okhotin (2010): refactored and generalized.
X Efficiently parallelized.
Implementing on a Graphics Processing Unit.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 2 / 18
Background
High-performance hardware is parallel.
Most algorithms are (partially) sequential.
Find the bottleneck and parallelize it.
The speaker’s case:Syntax analysis for general context-free grammars.
I Sequential nature.I Typically implemented combinatorially.I Can be done via Boolean matrix multiplication.
F Valiant (1975): theoretical bound.F Okhotin (2010): refactored and generalized.
X Efficiently parallelized.
Implementing on a Graphics Processing Unit.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 2 / 18
Part I
GPU programming
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 3 / 18
Graphics Processing Units
Designed for 3D graphics in computer games.
I Shading.
I Texturing.
I Per pixel effects.
I The same function for each pixel.
I Function as a kernel (program).
I Pixel as a work item.General purpose computation on GPUs.
I Tens of cores, each with multiple ALUs.I Approaching 1 Teraflop.I Priced as a consumer toy.
Best price to performance ratio.
Special programming techniques.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 4 / 18
Graphics Processing Units
Designed for 3D graphics in computer games.I Shading.
I Texturing.
I Per pixel effects.
I The same function for each pixel.
I Function as a kernel (program).
I Pixel as a work item.General purpose computation on GPUs.
I Tens of cores, each with multiple ALUs.I Approaching 1 Teraflop.I Priced as a consumer toy.
Best price to performance ratio.
Special programming techniques.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 4 / 18
Graphics Processing Units
Designed for 3D graphics in computer games.I Shading.
I Texturing.
I Per pixel effects.
I The same function for each pixel.
I Function as a kernel (program).
I Pixel as a work item.General purpose computation on GPUs.
I Tens of cores, each with multiple ALUs.I Approaching 1 Teraflop.I Priced as a consumer toy.
Best price to performance ratio.
Special programming techniques.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 4 / 18
Graphics Processing Units
Designed for 3D graphics in computer games.I Shading.
I Texturing.
I Per pixel effects.
I The same function for each pixel.
I Function as a kernel (program).
I Pixel as a work item.General purpose computation on GPUs.
I Tens of cores, each with multiple ALUs.I Approaching 1 Teraflop.I Priced as a consumer toy.
Best price to performance ratio.
Special programming techniques.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 4 / 18
Graphics Processing Units
Designed for 3D graphics in computer games.I Shading.
I Texturing.
I Per pixel effects.
I The same function for each pixel.
I Function as a kernel (program).
I Pixel as a work item.General purpose computation on GPUs.
I Tens of cores, each with multiple ALUs.I Approaching 1 Teraflop.I Priced as a consumer toy.
Best price to performance ratio.
Special programming techniques.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 4 / 18
Graphics Processing Units
Designed for 3D graphics in computer games.I Shading.
I Texturing.
I Per pixel effects.
I The same function for each pixel.
I Function as a kernel (program).
I Pixel as a work item.General purpose computation on GPUs.
I Tens of cores, each with multiple ALUs.I Approaching 1 Teraflop.I Priced as a consumer toy.
Best price to performance ratio.
Special programming techniques.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 4 / 18
Graphics Processing Units
Designed for 3D graphics in computer games.I Shading.
I Texturing.
I Per pixel effects.
I The same function for each pixel.
I Function as a kernel (program).
I Pixel as a work item.
General purpose computation on GPUs.
I Tens of cores, each with multiple ALUs.I Approaching 1 Teraflop.I Priced as a consumer toy.
Best price to performance ratio.
Special programming techniques.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 4 / 18
Graphics Processing Units
Designed for 3D graphics in computer games.I Shading.
I Texturing.
I Per pixel effects.
I The same function for each pixel.
I Function as a kernel (program).
I Pixel as a work item.General purpose computation on GPUs.
I Tens of cores, each with multiple ALUs.I Approaching 1 Teraflop.I Priced as a consumer toy.
Best price to performance ratio.
Special programming techniques.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 4 / 18
Graphics Processing Units
Designed for 3D graphics in computer games.I Shading.
I Texturing.
I Per pixel effects.
I The same function for each pixel.
I Function as a kernel (program).
I Pixel as a work item.General purpose computation on GPUs.
I Tens of cores, each with multiple ALUs.
I Approaching 1 Teraflop.I Priced as a consumer toy.
Best price to performance ratio.
Special programming techniques.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 4 / 18
Graphics Processing Units
Designed for 3D graphics in computer games.I Shading.
I Texturing.
I Per pixel effects.
I The same function for each pixel.
I Function as a kernel (program).
I Pixel as a work item.General purpose computation on GPUs.
I Tens of cores, each with multiple ALUs.I Approaching 1 Teraflop.
I Priced as a consumer toy.
Best price to performance ratio.
Special programming techniques.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 4 / 18
Graphics Processing Units
Designed for 3D graphics in computer games.I Shading.
I Texturing.
I Per pixel effects.
I The same function for each pixel.
I Function as a kernel (program).
I Pixel as a work item.General purpose computation on GPUs.
I Tens of cores, each with multiple ALUs.I Approaching 1 Teraflop.I Priced as a consumer toy.
Best price to performance ratio.
Special programming techniques.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 4 / 18
Graphics Processing Units
Designed for 3D graphics in computer games.I Shading.
I Texturing.
I Per pixel effects.
I The same function for each pixel.
I Function as a kernel (program).
I Pixel as a work item.General purpose computation on GPUs.
I Tens of cores, each with multiple ALUs.I Approaching 1 Teraflop.I Priced as a consumer toy.
Best price to performance ratio.
Special programming techniques.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 4 / 18
Graphics Processing Units
Designed for 3D graphics in computer games.I Shading.
I Texturing.
I Per pixel effects.
I The same function for each pixel.
I Function as a kernel (program).
I Pixel as a work item.General purpose computation on GPUs.
I Tens of cores, each with multiple ALUs.I Approaching 1 Teraflop.I Priced as a consumer toy.
Best price to performance ratio.
Special programming techniques.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 4 / 18
GPU programming
Proprietary interfaces: NVIDIA CUDA, ATI Stream.
Device-independent language: OpenCL.
I Supported by NVIDIA and ATI drivers.I CPU implementation.
Kernel: program running on GPU.
I Dialect of C.I Computes one “work item”.I Executed for a grid of work items.
Host code running on a CPU.
I Allocate GPU memory.I Load and compile a kernel.I Give arguments to the kernel.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 5 / 18
GPU programming
Proprietary interfaces: NVIDIA CUDA, ATI Stream.
Device-independent language: OpenCL.
I Supported by NVIDIA and ATI drivers.I CPU implementation.
Kernel: program running on GPU.
I Dialect of C.I Computes one “work item”.I Executed for a grid of work items.
Host code running on a CPU.
I Allocate GPU memory.I Load and compile a kernel.I Give arguments to the kernel.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 5 / 18
GPU programming
Proprietary interfaces: NVIDIA CUDA, ATI Stream.
Device-independent language: OpenCL.I Supported by NVIDIA and ATI drivers.
I CPU implementation.
Kernel: program running on GPU.
I Dialect of C.I Computes one “work item”.I Executed for a grid of work items.
Host code running on a CPU.
I Allocate GPU memory.I Load and compile a kernel.I Give arguments to the kernel.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 5 / 18
GPU programming
Proprietary interfaces: NVIDIA CUDA, ATI Stream.
Device-independent language: OpenCL.I Supported by NVIDIA and ATI drivers.I CPU implementation.
Kernel: program running on GPU.
I Dialect of C.I Computes one “work item”.I Executed for a grid of work items.
Host code running on a CPU.
I Allocate GPU memory.I Load and compile a kernel.I Give arguments to the kernel.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 5 / 18
GPU programming
Proprietary interfaces: NVIDIA CUDA, ATI Stream.
Device-independent language: OpenCL.I Supported by NVIDIA and ATI drivers.I CPU implementation.
Kernel: program running on GPU.
I Dialect of C.I Computes one “work item”.I Executed for a grid of work items.
Host code running on a CPU.
I Allocate GPU memory.I Load and compile a kernel.I Give arguments to the kernel.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 5 / 18
GPU programming
Proprietary interfaces: NVIDIA CUDA, ATI Stream.
Device-independent language: OpenCL.I Supported by NVIDIA and ATI drivers.I CPU implementation.
Kernel: program running on GPU.I Dialect of C.
I Computes one “work item”.I Executed for a grid of work items.
Host code running on a CPU.
I Allocate GPU memory.I Load and compile a kernel.I Give arguments to the kernel.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 5 / 18
GPU programming
Proprietary interfaces: NVIDIA CUDA, ATI Stream.
Device-independent language: OpenCL.I Supported by NVIDIA and ATI drivers.I CPU implementation.
Kernel: program running on GPU.I Dialect of C.I Computes one “work item”.
I Executed for a grid of work items.
Host code running on a CPU.
I Allocate GPU memory.I Load and compile a kernel.I Give arguments to the kernel.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 5 / 18
GPU programming
Proprietary interfaces: NVIDIA CUDA, ATI Stream.
Device-independent language: OpenCL.I Supported by NVIDIA and ATI drivers.I CPU implementation.
Kernel: program running on GPU.I Dialect of C.I Computes one “work item”.I Executed for a grid of work items.
Host code running on a CPU.
I Allocate GPU memory.I Load and compile a kernel.I Give arguments to the kernel.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 5 / 18
GPU programming
Proprietary interfaces: NVIDIA CUDA, ATI Stream.
Device-independent language: OpenCL.I Supported by NVIDIA and ATI drivers.I CPU implementation.
Kernel: program running on GPU.I Dialect of C.I Computes one “work item”.I Executed for a grid of work items.
Host code running on a CPU.
I Allocate GPU memory.I Load and compile a kernel.I Give arguments to the kernel.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 5 / 18
GPU programming
Proprietary interfaces: NVIDIA CUDA, ATI Stream.
Device-independent language: OpenCL.I Supported by NVIDIA and ATI drivers.I CPU implementation.
Kernel: program running on GPU.I Dialect of C.I Computes one “work item”.I Executed for a grid of work items.
Host code running on a CPU.I Allocate GPU memory.
I Load and compile a kernel.I Give arguments to the kernel.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 5 / 18
GPU programming
Proprietary interfaces: NVIDIA CUDA, ATI Stream.
Device-independent language: OpenCL.I Supported by NVIDIA and ATI drivers.I CPU implementation.
Kernel: program running on GPU.I Dialect of C.I Computes one “work item”.I Executed for a grid of work items.
Host code running on a CPU.I Allocate GPU memory.I Load and compile a kernel.
I Give arguments to the kernel.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 5 / 18
GPU programming
Proprietary interfaces: NVIDIA CUDA, ATI Stream.
Device-independent language: OpenCL.I Supported by NVIDIA and ATI drivers.I CPU implementation.
Kernel: program running on GPU.I Dialect of C.I Computes one “work item”.I Executed for a grid of work items.
Host code running on a CPU.I Allocate GPU memory.I Load and compile a kernel.I Give arguments to the kernel.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 5 / 18
Execution and memory model
2–32 multithreaded cores, each with 8–16 ALUs.
Many threads running on a core, grouped into warps.
Main system memory (“host memory”): accessed through the bus.
Global memory: accessed by all GPU cores (up to 150 Gb/s).
I 64–512-bit bus.I Multiple threads would better access adjacent words.
Local memory: shared by all threads on a core.
I Much faster.I Often used to cache data.
Private memory, owned by a thread.
Computation divided into work-items.
I 1d, 2d or 3d grid of work-items.I Block of work-items: work-group.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 6 / 18
Execution and memory model
2–32 multithreaded cores, each with 8–16 ALUs.
Many threads running on a core, grouped into warps.
Main system memory (“host memory”): accessed through the bus.
Global memory: accessed by all GPU cores (up to 150 Gb/s).
I 64–512-bit bus.I Multiple threads would better access adjacent words.
Local memory: shared by all threads on a core.
I Much faster.I Often used to cache data.
Private memory, owned by a thread.
Computation divided into work-items.
I 1d, 2d or 3d grid of work-items.I Block of work-items: work-group.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 6 / 18
Execution and memory model
2–32 multithreaded cores, each with 8–16 ALUs.
Many threads running on a core, grouped into warps.
Main system memory (“host memory”): accessed through the bus.
Global memory: accessed by all GPU cores (up to 150 Gb/s).
I 64–512-bit bus.I Multiple threads would better access adjacent words.
Local memory: shared by all threads on a core.
I Much faster.I Often used to cache data.
Private memory, owned by a thread.
Computation divided into work-items.
I 1d, 2d or 3d grid of work-items.I Block of work-items: work-group.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 6 / 18
Execution and memory model
2–32 multithreaded cores, each with 8–16 ALUs.
Many threads running on a core, grouped into warps.
Main system memory (“host memory”): accessed through the bus.
Global memory: accessed by all GPU cores (up to 150 Gb/s).
I 64–512-bit bus.I Multiple threads would better access adjacent words.
Local memory: shared by all threads on a core.
I Much faster.I Often used to cache data.
Private memory, owned by a thread.
Computation divided into work-items.
I 1d, 2d or 3d grid of work-items.I Block of work-items: work-group.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 6 / 18
Execution and memory model
2–32 multithreaded cores, each with 8–16 ALUs.
Many threads running on a core, grouped into warps.
Main system memory (“host memory”): accessed through the bus.
Global memory: accessed by all GPU cores (up to 150 Gb/s).I 64–512-bit bus.
I Multiple threads would better access adjacent words.
Local memory: shared by all threads on a core.
I Much faster.I Often used to cache data.
Private memory, owned by a thread.
Computation divided into work-items.
I 1d, 2d or 3d grid of work-items.I Block of work-items: work-group.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 6 / 18
Execution and memory model
2–32 multithreaded cores, each with 8–16 ALUs.
Many threads running on a core, grouped into warps.
Main system memory (“host memory”): accessed through the bus.
Global memory: accessed by all GPU cores (up to 150 Gb/s).I 64–512-bit bus.I Multiple threads would better access adjacent words.
Local memory: shared by all threads on a core.
I Much faster.I Often used to cache data.
Private memory, owned by a thread.
Computation divided into work-items.
I 1d, 2d or 3d grid of work-items.I Block of work-items: work-group.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 6 / 18
Execution and memory model
2–32 multithreaded cores, each with 8–16 ALUs.
Many threads running on a core, grouped into warps.
Main system memory (“host memory”): accessed through the bus.
Global memory: accessed by all GPU cores (up to 150 Gb/s).I 64–512-bit bus.I Multiple threads would better access adjacent words.
Local memory: shared by all threads on a core.
I Much faster.I Often used to cache data.
Private memory, owned by a thread.
Computation divided into work-items.
I 1d, 2d or 3d grid of work-items.I Block of work-items: work-group.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 6 / 18
Execution and memory model
2–32 multithreaded cores, each with 8–16 ALUs.
Many threads running on a core, grouped into warps.
Main system memory (“host memory”): accessed through the bus.
Global memory: accessed by all GPU cores (up to 150 Gb/s).I 64–512-bit bus.I Multiple threads would better access adjacent words.
Local memory: shared by all threads on a core.I Much faster.
I Often used to cache data.
Private memory, owned by a thread.
Computation divided into work-items.
I 1d, 2d or 3d grid of work-items.I Block of work-items: work-group.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 6 / 18
Execution and memory model
2–32 multithreaded cores, each with 8–16 ALUs.
Many threads running on a core, grouped into warps.
Main system memory (“host memory”): accessed through the bus.
Global memory: accessed by all GPU cores (up to 150 Gb/s).I 64–512-bit bus.I Multiple threads would better access adjacent words.
Local memory: shared by all threads on a core.I Much faster.I Often used to cache data.
Private memory, owned by a thread.
Computation divided into work-items.
I 1d, 2d or 3d grid of work-items.I Block of work-items: work-group.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 6 / 18
Execution and memory model
2–32 multithreaded cores, each with 8–16 ALUs.
Many threads running on a core, grouped into warps.
Main system memory (“host memory”): accessed through the bus.
Global memory: accessed by all GPU cores (up to 150 Gb/s).I 64–512-bit bus.I Multiple threads would better access adjacent words.
Local memory: shared by all threads on a core.I Much faster.I Often used to cache data.
Private memory, owned by a thread.
Computation divided into work-items.
I 1d, 2d or 3d grid of work-items.I Block of work-items: work-group.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 6 / 18
Execution and memory model
2–32 multithreaded cores, each with 8–16 ALUs.
Many threads running on a core, grouped into warps.
Main system memory (“host memory”): accessed through the bus.
Global memory: accessed by all GPU cores (up to 150 Gb/s).I 64–512-bit bus.I Multiple threads would better access adjacent words.
Local memory: shared by all threads on a core.I Much faster.I Often used to cache data.
Private memory, owned by a thread.
Computation divided into work-items.
I 1d, 2d or 3d grid of work-items.I Block of work-items: work-group.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 6 / 18
Execution and memory model
2–32 multithreaded cores, each with 8–16 ALUs.
Many threads running on a core, grouped into warps.
Main system memory (“host memory”): accessed through the bus.
Global memory: accessed by all GPU cores (up to 150 Gb/s).I 64–512-bit bus.I Multiple threads would better access adjacent words.
Local memory: shared by all threads on a core.I Much faster.I Often used to cache data.
Private memory, owned by a thread.
Computation divided into work-items.I 1d, 2d or 3d grid of work-items.
I Block of work-items: work-group.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 6 / 18
Execution and memory model
2–32 multithreaded cores, each with 8–16 ALUs.
Many threads running on a core, grouped into warps.
Main system memory (“host memory”): accessed through the bus.
Global memory: accessed by all GPU cores (up to 150 Gb/s).I 64–512-bit bus.I Multiple threads would better access adjacent words.
Local memory: shared by all threads on a core.I Much faster.I Often used to cache data.
Private memory, owned by a thread.
Computation divided into work-items.I 1d, 2d or 3d grid of work-items.I Block of work-items: work-group.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 6 / 18
Primitive example
Example (Jacobi method)
1 Compile the program.
2 Allocate n*n*sizeof(float) bytes for A and B.
3 Create kernel with arguments (n, n,A,B).
4 Invoke with work items {0, . . . , n − 3} × {0, . . . , n − 3}.5 Wait for termination.
It works.
. . . though very inefficiently:I Reading 4 times.I Memory alignment ignored.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 7 / 18
Primitive example
Example (Jacobi method)
1 Compile the program.
2 Allocate n*n*sizeof(float) bytes for A and B.
3 Create kernel with arguments (n, n,A,B).
4 Invoke with work items {0, . . . , n − 3} × {0, . . . , n − 3}.5 Wait for termination.
It works.
. . . though very inefficiently:I Reading 4 times.I Memory alignment ignored.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 7 / 18
Primitive example
Example (Jacobi method)
1 Compile the program.
2 Allocate n*n*sizeof(float) bytes for A and B.
3 Create kernel with arguments (n, n,A,B).
4 Invoke with work items {0, . . . , n − 3} × {0, . . . , n − 3}.5 Wait for termination.
It works.
. . . though very inefficiently:I Reading 4 times.I Memory alignment ignored.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 7 / 18
Primitive example
Example (Jacobi method)
1 Compile the program.
2 Allocate n*n*sizeof(float) bytes for A and B.
3 Create kernel with arguments (n, n,A,B).
4 Invoke with work items {0, . . . , n − 3} × {0, . . . , n − 3}.
5 Wait for termination.
It works.
. . . though very inefficiently:I Reading 4 times.I Memory alignment ignored.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 7 / 18
Primitive example
Example (Jacobi method)
1 Compile the program.
2 Allocate n*n*sizeof(float) bytes for A and B.
3 Create kernel with arguments (n, n,A,B).
4 Invoke with work items {0, . . . , n − 3} × {0, . . . , n − 3}.5 Wait for termination.
It works.
. . . though very inefficiently:I Reading 4 times.I Memory alignment ignored.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 7 / 18
Primitive example
Example (Jacobi method)
1 Compile the program.
2 Allocate n*n*sizeof(float) bytes for A and B.
3 Create kernel with arguments (n, n,A,B).
4 Invoke with work items {0, . . . , n − 3} × {0, . . . , n − 3}.5 Wait for termination.
It works.
. . . though very inefficiently:I Reading 4 times.I Memory alignment ignored.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 7 / 18
Primitive example
Example (Jacobi method)
1 Compile the program.
2 Allocate n*n*sizeof(float) bytes for A and B.
3 Create kernel with arguments (n, n,A,B).
4 Invoke with work items {0, . . . , n − 3} × {0, . . . , n − 3}.5 Wait for termination.
It works. . . . though very inefficiently:
I Reading 4 times.I Memory alignment ignored.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 7 / 18
Primitive example
Example (Jacobi method)
1 Compile the program.
2 Allocate n*n*sizeof(float) bytes for A and B.
3 Create kernel with arguments (n, n,A,B).
4 Invoke with work items {0, . . . , n − 3} × {0, . . . , n − 3}.5 Wait for termination.
It works. . . . though very inefficiently:I Reading 4 times.
I Memory alignment ignored.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 7 / 18
Primitive example
Example (Jacobi method)
1 Compile the program.
2 Allocate n*n*sizeof(float) bytes for A and B.
3 Create kernel with arguments (n, n,A,B).
4 Invoke with work items {0, . . . , n − 3} × {0, . . . , n − 3}.5 Wait for termination.
It works. . . . though very inefficiently:I Reading 4 times.I Memory alignment ignored.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 7 / 18
Part II
Boolean matrix multiplication
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 8 / 18
Matrix multiplication as such
S: a semiring.
A ∈ Sm×`, B ∈ S`×n,
Their product, C ∈ Sm×n:
Ci ,j =∑k=1
Ai ,k · Bk,j
`mn multiplications, (`− 1)mn additions.
X In this talk:
S: {0, 1} = B;Sum: disjunction;
Product: conjunction;Square matrices: m = n = k.
Θ(n3) bit operations.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 9 / 18
Matrix multiplication as such
S: a semiring.
A ∈ Sm×`, B ∈ S`×n,
Their product, C ∈ Sm×n:
Ci ,j =∑k=1
Ai ,k · Bk,j
`mn multiplications, (`− 1)mn additions.
X In this talk:
S: {0, 1} = B;Sum: disjunction;
Product: conjunction;Square matrices: m = n = k.
Θ(n3) bit operations.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 9 / 18
Matrix multiplication as such
S: a semiring.
A ∈ Sm×`, B ∈ S`×n,
Their product, C ∈ Sm×n:
Ci ,j =∑k=1
Ai ,k · Bk,j
`mn multiplications, (`− 1)mn additions.
X In this talk:
S: {0, 1} = B;Sum: disjunction;
Product: conjunction;Square matrices: m = n = k.
Θ(n3) bit operations.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 9 / 18
Matrix multiplication as such
S: a semiring.
A ∈ Sm×`, B ∈ S`×n,
Their product, C ∈ Sm×n:
Ci ,j =∑k=1
Ai ,k · Bk,j
`mn multiplications, (`− 1)mn additions.
X In this talk:
S: {0, 1} = B;Sum: disjunction;
Product: conjunction;Square matrices: m = n = k.
Θ(n3) bit operations.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 9 / 18
Matrix multiplication as such
S: a semiring.
A ∈ Sm×`, B ∈ S`×n,
Their product, C ∈ Sm×n:
Ci ,j =∑k=1
Ai ,k · Bk,j
`mn multiplications, (`− 1)mn additions.
X In this talk:
S: {0, 1} = B;Sum: disjunction;
Product: conjunction;Square matrices: m = n = k.
Θ(n3) bit operations.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 9 / 18
Matrix multiplication as such
S: a semiring.
A ∈ Sm×`, B ∈ S`×n,
Their product, C ∈ Sm×n:
Ci ,j =∑k=1
Ai ,k · Bk,j
`mn multiplications, (`− 1)mn additions.
X In this talk:
S: {0, 1} = B;
Sum: disjunction;Product: conjunction;
Square matrices: m = n = k.
Θ(n3) bit operations.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 9 / 18
Matrix multiplication as such
S: a semiring.
A ∈ Sm×`, B ∈ S`×n,
Their product, C ∈ Sm×n:
Ci ,j =∑k=1
Ai ,k · Bk,j
`mn multiplications, (`− 1)mn additions.
X In this talk:
S: {0, 1} = B;Sum: disjunction;
Product: conjunction;Square matrices: m = n = k.
Θ(n3) bit operations.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 9 / 18
Matrix multiplication as such
S: a semiring.
A ∈ Sm×`, B ∈ S`×n,
Their product, C ∈ Sm×n:
Ci ,j =∑k=1
Ai ,k · Bk,j
`mn multiplications, (`− 1)mn additions.
X In this talk:
S: {0, 1} = B;Sum: disjunction;
Product: conjunction;
Square matrices: m = n = k.
Θ(n3) bit operations.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 9 / 18
Matrix multiplication as such
S: a semiring.
A ∈ Sm×`, B ∈ S`×n,
Their product, C ∈ Sm×n:
Ci ,j =∑k=1
Ai ,k · Bk,j
`mn multiplications, (`− 1)mn additions.
X In this talk:
S: {0, 1} = B;Sum: disjunction;
Product: conjunction;Square matrices: m = n = k.
Θ(n3) bit operations.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 9 / 18
Matrix multiplication as such
S: a semiring.
A ∈ Sm×`, B ∈ S`×n,
Their product, C ∈ Sm×n:
Ci ,j =∑k=1
Ai ,k · Bk,j
`mn multiplications, (`− 1)mn additions.
X In this talk:
S: {0, 1} = B;Sum: disjunction;
Product: conjunction;Square matrices: m = n = k.
Θ(n3) bit operations.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 9 / 18
Fast matrix multiplication over a ring
# of multiplications for 2× 2 matrices?
8(a11 a12a21 a22
)×(
b11 b12b21 b22
)=
(a11b11 + a12b21 a11b12 + a12b22a21b11 + a22b21 a21b12 + a22b22
)
Assume S is a ring.
I ∀x ∈ S ∃(−x) ∈ S : x + (−x) = 0
Strassen (1969): 2× 2 matrices using 7 multiplications.
I First, compute 14 linear combinations.I Second, calculate their products.I Their linear combinations yield the results.
I Larger matrices: as block matrices.
(A11 A12
A21 A22
)×(
B11 B12
B21 B22
).
I O(nlog2 7) operations for n × n matrices.
Coppersmith and Winograd (1990): O(n2.376) operations.
X (B,∧,∨) is not a ring.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 10 / 18
Fast matrix multiplication over a ring
# of multiplications for 2× 2 matrices? 8(a11 a12a21 a22
)×(
b11 b12b21 b22
)=
(a11b11 + a12b21 a11b12 + a12b22a21b11 + a22b21 a21b12 + a22b22
)
Assume S is a ring.
I ∀x ∈ S ∃(−x) ∈ S : x + (−x) = 0
Strassen (1969): 2× 2 matrices using 7 multiplications.
I First, compute 14 linear combinations.I Second, calculate their products.I Their linear combinations yield the results.
I Larger matrices: as block matrices.
(A11 A12
A21 A22
)×(
B11 B12
B21 B22
).
I O(nlog2 7) operations for n × n matrices.
Coppersmith and Winograd (1990): O(n2.376) operations.
X (B,∧,∨) is not a ring.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 10 / 18
Fast matrix multiplication over a ring
# of multiplications for 2× 2 matrices? 8(a11 a12a21 a22
)×(
b11 b12b21 b22
)=
(a11b11 + a12b21 a11b12 + a12b22a21b11 + a22b21 a21b12 + a22b22
)
Assume S is a ring.
I ∀x ∈ S ∃(−x) ∈ S : x + (−x) = 0
Strassen (1969): 2× 2 matrices using 7 multiplications.
I First, compute 14 linear combinations.I Second, calculate their products.I Their linear combinations yield the results.
I Larger matrices: as block matrices.
(A11 A12
A21 A22
)×(
B11 B12
B21 B22
).
I O(nlog2 7) operations for n × n matrices.
Coppersmith and Winograd (1990): O(n2.376) operations.
X (B,∧,∨) is not a ring.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 10 / 18
Fast matrix multiplication over a ring
# of multiplications for 2× 2 matrices? 8(a11 a12a21 a22
)×(
b11 b12b21 b22
)=
(a11b11 + a12b21 a11b12 + a12b22a21b11 + a22b21 a21b12 + a22b22
)
Assume S is a ring.I ∀x ∈ S ∃(−x) ∈ S : x + (−x) = 0
Strassen (1969): 2× 2 matrices using 7 multiplications.
I First, compute 14 linear combinations.I Second, calculate their products.I Their linear combinations yield the results.
I Larger matrices: as block matrices.
(A11 A12
A21 A22
)×(
B11 B12
B21 B22
).
I O(nlog2 7) operations for n × n matrices.
Coppersmith and Winograd (1990): O(n2.376) operations.
X (B,∧,∨) is not a ring.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 10 / 18
Fast matrix multiplication over a ring
# of multiplications for 2× 2 matrices? 8(a11 a12a21 a22
)×(
b11 b12b21 b22
)=
(a11b11 + a12b21 a11b12 + a12b22a21b11 + a22b21 a21b12 + a22b22
)
Assume S is a ring.I ∀x ∈ S ∃(−x) ∈ S : x + (−x) = 0
Strassen (1969): 2× 2 matrices using 7 multiplications.
I First, compute 14 linear combinations.I Second, calculate their products.I Their linear combinations yield the results.
I Larger matrices: as block matrices.
(A11 A12
A21 A22
)×(
B11 B12
B21 B22
).
I O(nlog2 7) operations for n × n matrices.
Coppersmith and Winograd (1990): O(n2.376) operations.
X (B,∧,∨) is not a ring.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 10 / 18
Fast matrix multiplication over a ring
# of multiplications for 2× 2 matrices? 8(a11 a12a21 a22
)×(
b11 b12b21 b22
)=
(a11b11 + a12b21 a11b12 + a12b22a21b11 + a22b21 a21b12 + a22b22
)
Assume S is a ring.I ∀x ∈ S ∃(−x) ∈ S : x + (−x) = 0
Strassen (1969): 2× 2 matrices using 7 multiplications.I First, compute 14 linear combinations.
I Second, calculate their products.I Their linear combinations yield the results.
I Larger matrices: as block matrices.
(A11 A12
A21 A22
)×(
B11 B12
B21 B22
).
I O(nlog2 7) operations for n × n matrices.
Coppersmith and Winograd (1990): O(n2.376) operations.
X (B,∧,∨) is not a ring.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 10 / 18
Fast matrix multiplication over a ring
# of multiplications for 2× 2 matrices? 8(a11 a12a21 a22
)×(
b11 b12b21 b22
)=
(a11b11 + a12b21 a11b12 + a12b22a21b11 + a22b21 a21b12 + a22b22
)
Assume S is a ring.I ∀x ∈ S ∃(−x) ∈ S : x + (−x) = 0
Strassen (1969): 2× 2 matrices using 7 multiplications.I First, compute 14 linear combinations.I Second, calculate their products.
I Their linear combinations yield the results.
I Larger matrices: as block matrices.
(A11 A12
A21 A22
)×(
B11 B12
B21 B22
).
I O(nlog2 7) operations for n × n matrices.
Coppersmith and Winograd (1990): O(n2.376) operations.
X (B,∧,∨) is not a ring.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 10 / 18
Fast matrix multiplication over a ring
# of multiplications for 2× 2 matrices? 8(a11 a12a21 a22
)×(
b11 b12b21 b22
)=
(a11b11 + a12b21 a11b12 + a12b22a21b11 + a22b21 a21b12 + a22b22
)
Assume S is a ring.I ∀x ∈ S ∃(−x) ∈ S : x + (−x) = 0
Strassen (1969): 2× 2 matrices using 7 multiplications.I First, compute 14 linear combinations.I Second, calculate their products.I Their linear combinations yield the results.
I Larger matrices: as block matrices.
(A11 A12
A21 A22
)×(
B11 B12
B21 B22
).
I O(nlog2 7) operations for n × n matrices.
Coppersmith and Winograd (1990): O(n2.376) operations.
X (B,∧,∨) is not a ring.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 10 / 18
Fast matrix multiplication over a ring
# of multiplications for 2× 2 matrices? 8(a11 a12a21 a22
)×(
b11 b12b21 b22
)=
(a11b11 + a12b21 a11b12 + a12b22a21b11 + a22b21 a21b12 + a22b22
)
Assume S is a ring.I ∀x ∈ S ∃(−x) ∈ S : x + (−x) = 0
Strassen (1969): 2× 2 matrices using 7 multiplications.I First, compute 14 linear combinations.I Second, calculate their products.I Their linear combinations yield the results.
I Larger matrices: as block matrices.
(A11 A12
A21 A22
)×(
B11 B12
B21 B22
).
I O(nlog2 7) operations for n × n matrices.
Coppersmith and Winograd (1990): O(n2.376) operations.
X (B,∧,∨) is not a ring.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 10 / 18
Fast matrix multiplication over a ring
# of multiplications for 2× 2 matrices? 8(a11 a12a21 a22
)×(
b11 b12b21 b22
)=
(a11b11 + a12b21 a11b12 + a12b22a21b11 + a22b21 a21b12 + a22b22
)
Assume S is a ring.I ∀x ∈ S ∃(−x) ∈ S : x + (−x) = 0
Strassen (1969): 2× 2 matrices using 7 multiplications.I First, compute 14 linear combinations.I Second, calculate their products.I Their linear combinations yield the results.
I Larger matrices: as block matrices.
(A11 A12
A21 A22
)×(
B11 B12
B21 B22
).
I O(nlog2 7) operations for n × n matrices.
Coppersmith and Winograd (1990): O(n2.376) operations.
X (B,∧,∨) is not a ring.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 10 / 18
Fast matrix multiplication over a ring
# of multiplications for 2× 2 matrices? 8(a11 a12a21 a22
)×(
b11 b12b21 b22
)=
(a11b11 + a12b21 a11b12 + a12b22a21b11 + a22b21 a21b12 + a22b22
)
Assume S is a ring.I ∀x ∈ S ∃(−x) ∈ S : x + (−x) = 0
Strassen (1969): 2× 2 matrices using 7 multiplications.I First, compute 14 linear combinations.I Second, calculate their products.I Their linear combinations yield the results.
I Larger matrices: as block matrices.
(A11 A12
A21 A22
)×(
B11 B12
B21 B22
).
I O(nlog2 7) operations for n × n matrices.
Coppersmith and Winograd (1990): O(n2.376) operations.
X (B,∧,∨) is not a ring.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 10 / 18
Fast matrix multiplication over a ring
# of multiplications for 2× 2 matrices? 8(a11 a12a21 a22
)×(
b11 b12b21 b22
)=
(a11b11 + a12b21 a11b12 + a12b22a21b11 + a22b21 a21b12 + a22b22
)
Assume S is a ring.I ∀x ∈ S ∃(−x) ∈ S : x + (−x) = 0
Strassen (1969): 2× 2 matrices using 7 multiplications.I First, compute 14 linear combinations.I Second, calculate their products.I Their linear combinations yield the results.
I Larger matrices: as block matrices.
(A11 A12
A21 A22
)×(
B11 B12
B21 B22
).
I O(nlog2 7) operations for n × n matrices.
Coppersmith and Winograd (1990): O(n2.376) operations.
X (B,∧,∨) is not a ring.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 10 / 18
Applying fast matrix multiplication to the Boolean semiring
n × n Boolean matrices.
Multiplying them in Zn+1.
(1 01 1
)×(
0 11 1
)=
(0 11 2
)︸ ︷︷ ︸
in Z3
=
(0 11 1
)︸ ︷︷ ︸
in B
One bit → dlogn+1e bits.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 11 / 18
Applying fast matrix multiplication to the Boolean semiring
n × n Boolean matrices.
Multiplying them in Zn+1.
(1 01 1
)×(
0 11 1
)=
(0 11 2
)︸ ︷︷ ︸
in Z3
=
(0 11 1
)︸ ︷︷ ︸
in B
One bit → dlogn+1e bits.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 11 / 18
Applying fast matrix multiplication to the Boolean semiring
n × n Boolean matrices.
Multiplying them in Zn+1.(1 01 1
)×(
0 11 1
)
=
(0 11 2
)︸ ︷︷ ︸
in Z3
=
(0 11 1
)︸ ︷︷ ︸
in B
One bit → dlogn+1e bits.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 11 / 18
Applying fast matrix multiplication to the Boolean semiring
n × n Boolean matrices.
Multiplying them in Zn+1.(1 01 1
)×(
0 11 1
)=
(0 11 2
)︸ ︷︷ ︸
in Z3
=
(0 11 1
)︸ ︷︷ ︸
in B
One bit → dlogn+1e bits.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 11 / 18
Applying fast matrix multiplication to the Boolean semiring
n × n Boolean matrices.
Multiplying them in Zn+1.(1 01 1
)×(
0 11 1
)=
(0 11 2
)︸ ︷︷ ︸
in Z3
=
(0 11 1
)︸ ︷︷ ︸
in B
One bit → dlogn+1e bits.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 11 / 18
Applying fast matrix multiplication to the Boolean semiring
n × n Boolean matrices.
Multiplying them in Zn+1.(1 01 1
)×(
0 11 1
)=
(0 11 2
)︸ ︷︷ ︸
in Z3
=
(0 11 1
)︸ ︷︷ ︸
in B
One bit → dlogn+1e bits.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 11 / 18
An O( n3
log n) method for Boolean matrices
Arlazarov et al. (1970)
Fix k << n.Multiplying 1× k blocks of A by k × n blocks of B.
At most 2k different 1× k blocks.Pre-compute all 2k products with each of block of B ( nk blocks).Look up n bits for each 1× k block of A,Time complexity:
2k · nk· n︸ ︷︷ ︸
making the table
+n3
k︸︷︷︸multiplication
2n3
log n operations for k = log n.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 12 / 18
An O( n3
log n) method for Boolean matrices
Arlazarov et al. (1970)Fix k << n.
Multiplying 1× k blocks of A by k × n blocks of B.
At most 2k different 1× k blocks.Pre-compute all 2k products with each of block of B ( nk blocks).Look up n bits for each 1× k block of A,Time complexity:
2k · nk· n︸ ︷︷ ︸
making the table
+n3
k︸︷︷︸multiplication
2n3
log n operations for k = log n.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 12 / 18
An O( n3
log n) method for Boolean matrices
Arlazarov et al. (1970)Fix k << n.Multiplying 1× k blocks of A by k × n blocks of B.
At most 2k different 1× k blocks.Pre-compute all 2k products with each of block of B ( nk blocks).Look up n bits for each 1× k block of A,Time complexity:
2k · nk· n︸ ︷︷ ︸
making the table
+n3
k︸︷︷︸multiplication
2n3
log n operations for k = log n.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 12 / 18
An O( n3
log n) method for Boolean matrices
Arlazarov et al. (1970)Fix k << n.Multiplying 1× k blocks of A by k × n blocks of B.
At most 2k different 1× k blocks.Pre-compute all 2k products with each of block of B ( nk blocks).Look up n bits for each 1× k block of A,Time complexity:
2k · nk· n︸ ︷︷ ︸
making the table
+n3
k︸︷︷︸multiplication
2n3
log n operations for k = log n.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 12 / 18
An O( n3
log n) method for Boolean matrices
Arlazarov et al. (1970)Fix k << n.Multiplying 1× k blocks of A by k × n blocks of B.
At most 2k different 1× k blocks.
Pre-compute all 2k products with each of block of B ( nk blocks).Look up n bits for each 1× k block of A,Time complexity:
2k · nk· n︸ ︷︷ ︸
making the table
+n3
k︸︷︷︸multiplication
2n3
log n operations for k = log n.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 12 / 18
An O( n3
log n) method for Boolean matrices
Arlazarov et al. (1970)Fix k << n.Multiplying 1× k blocks of A by k × n blocks of B.
At most 2k different 1× k blocks.Pre-compute all 2k products with each of block of B ( nk blocks).
Look up n bits for each 1× k block of A,Time complexity:
2k · nk· n︸ ︷︷ ︸
making the table
+n3
k︸︷︷︸multiplication
2n3
log n operations for k = log n.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 12 / 18
An O( n3
log n) method for Boolean matrices
Arlazarov et al. (1970)Fix k << n.Multiplying 1× k blocks of A by k × n blocks of B.
At most 2k different 1× k blocks.Pre-compute all 2k products with each of block of B ( nk blocks).Look up n bits for each 1× k block of A,
Time complexity:
2k · nk· n︸ ︷︷ ︸
making the table
+n3
k︸︷︷︸multiplication
2n3
log n operations for k = log n.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 12 / 18
An O( n3
log n) method for Boolean matrices
Arlazarov et al. (1970)Fix k << n.Multiplying 1× k blocks of A by k × n blocks of B.
At most 2k different 1× k blocks.Pre-compute all 2k products with each of block of B ( nk blocks).Look up n bits for each 1× k block of A,Time complexity:
2k · nk· n︸ ︷︷ ︸
making the table
+n3
k︸︷︷︸multiplication
2n3
log n operations for k = log n.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 12 / 18
An O( n3
log n) method for Boolean matrices
Arlazarov et al. (1970)Fix k << n.Multiplying 1× k blocks of A by k × n blocks of B.
At most 2k different 1× k blocks.Pre-compute all 2k products with each of block of B ( nk blocks).Look up n bits for each 1× k block of A,Time complexity:
2k · nk· n︸ ︷︷ ︸
making the table
+n3
k︸︷︷︸multiplication
2n3
log n operations for k = log n.Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 12 / 18
Part III
Boolean matrix multiplication on a GPU
Joint work with Christian Reitwießner (Wurzburg)
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 13 / 18
Main performance considerations
Matrices A,B ∈ Bn×n are on the CPU:
I either multiply them on the CPU,I or send to the GPU (and use which method?).
If n < 200, faster to multiply than to transfer.
If n > 50000, will not fit on the GPU.
I Processing by parts.
Direct n3 multiplication.
I For n > 100 already superceded.
Arlazarov et al.: n3
log n operations.
I Basic operation: union of rows.I Works well on GPU.
Strassen’s method: O(nlog2 7).
I Have to multiply ints instead of bits!I Inductive on n, reducing to many small matrices.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 14 / 18
Main performance considerations
Matrices A,B ∈ Bn×n are on the CPU:I either multiply them on the CPU,
I or send to the GPU (and use which method?).
If n < 200, faster to multiply than to transfer.
If n > 50000, will not fit on the GPU.
I Processing by parts.
Direct n3 multiplication.
I For n > 100 already superceded.
Arlazarov et al.: n3
log n operations.
I Basic operation: union of rows.I Works well on GPU.
Strassen’s method: O(nlog2 7).
I Have to multiply ints instead of bits!I Inductive on n, reducing to many small matrices.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 14 / 18
Main performance considerations
Matrices A,B ∈ Bn×n are on the CPU:I either multiply them on the CPU,I or send to the GPU (and use which method?).
If n < 200, faster to multiply than to transfer.
If n > 50000, will not fit on the GPU.
I Processing by parts.
Direct n3 multiplication.
I For n > 100 already superceded.
Arlazarov et al.: n3
log n operations.
I Basic operation: union of rows.I Works well on GPU.
Strassen’s method: O(nlog2 7).
I Have to multiply ints instead of bits!I Inductive on n, reducing to many small matrices.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 14 / 18
Main performance considerations
Matrices A,B ∈ Bn×n are on the CPU:I either multiply them on the CPU,I or send to the GPU (and use which method?).
If n < 200, faster to multiply than to transfer.
If n > 50000, will not fit on the GPU.
I Processing by parts.
Direct n3 multiplication.
I For n > 100 already superceded.
Arlazarov et al.: n3
log n operations.
I Basic operation: union of rows.I Works well on GPU.
Strassen’s method: O(nlog2 7).
I Have to multiply ints instead of bits!I Inductive on n, reducing to many small matrices.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 14 / 18
Main performance considerations
Matrices A,B ∈ Bn×n are on the CPU:I either multiply them on the CPU,I or send to the GPU (and use which method?).
If n < 200, faster to multiply than to transfer.
If n > 50000, will not fit on the GPU.
I Processing by parts.
Direct n3 multiplication.
I For n > 100 already superceded.
Arlazarov et al.: n3
log n operations.
I Basic operation: union of rows.I Works well on GPU.
Strassen’s method: O(nlog2 7).
I Have to multiply ints instead of bits!I Inductive on n, reducing to many small matrices.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 14 / 18
Main performance considerations
Matrices A,B ∈ Bn×n are on the CPU:I either multiply them on the CPU,I or send to the GPU (and use which method?).
If n < 200, faster to multiply than to transfer.
If n > 50000, will not fit on the GPU.I Processing by parts.
Direct n3 multiplication.
I For n > 100 already superceded.
Arlazarov et al.: n3
log n operations.
I Basic operation: union of rows.I Works well on GPU.
Strassen’s method: O(nlog2 7).
I Have to multiply ints instead of bits!I Inductive on n, reducing to many small matrices.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 14 / 18
Main performance considerations
Matrices A,B ∈ Bn×n are on the CPU:I either multiply them on the CPU,I or send to the GPU (and use which method?).
If n < 200, faster to multiply than to transfer.
If n > 50000, will not fit on the GPU.I Processing by parts.
Direct n3 multiplication.
I For n > 100 already superceded.
Arlazarov et al.: n3
log n operations.
I Basic operation: union of rows.I Works well on GPU.
Strassen’s method: O(nlog2 7).
I Have to multiply ints instead of bits!I Inductive on n, reducing to many small matrices.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 14 / 18
Main performance considerations
Matrices A,B ∈ Bn×n are on the CPU:I either multiply them on the CPU,I or send to the GPU (and use which method?).
If n < 200, faster to multiply than to transfer.
If n > 50000, will not fit on the GPU.I Processing by parts.
Direct n3 multiplication.I For n > 100 already superceded.
Arlazarov et al.: n3
log n operations.
I Basic operation: union of rows.I Works well on GPU.
Strassen’s method: O(nlog2 7).
I Have to multiply ints instead of bits!I Inductive on n, reducing to many small matrices.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 14 / 18
Main performance considerations
Matrices A,B ∈ Bn×n are on the CPU:I either multiply them on the CPU,I or send to the GPU (and use which method?).
If n < 200, faster to multiply than to transfer.
If n > 50000, will not fit on the GPU.I Processing by parts.
Direct n3 multiplication.I For n > 100 already superceded.
Arlazarov et al.: n3
log n operations.
I Basic operation: union of rows.I Works well on GPU.
Strassen’s method: O(nlog2 7).
I Have to multiply ints instead of bits!I Inductive on n, reducing to many small matrices.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 14 / 18
Main performance considerations
Matrices A,B ∈ Bn×n are on the CPU:I either multiply them on the CPU,I or send to the GPU (and use which method?).
If n < 200, faster to multiply than to transfer.
If n > 50000, will not fit on the GPU.I Processing by parts.
Direct n3 multiplication.I For n > 100 already superceded.
Arlazarov et al.: n3
log n operations.I Basic operation: union of rows.
I Works well on GPU.
Strassen’s method: O(nlog2 7).
I Have to multiply ints instead of bits!I Inductive on n, reducing to many small matrices.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 14 / 18
Main performance considerations
Matrices A,B ∈ Bn×n are on the CPU:I either multiply them on the CPU,I or send to the GPU (and use which method?).
If n < 200, faster to multiply than to transfer.
If n > 50000, will not fit on the GPU.I Processing by parts.
Direct n3 multiplication.I For n > 100 already superceded.
Arlazarov et al.: n3
log n operations.I Basic operation: union of rows.I Works well on GPU.
Strassen’s method: O(nlog2 7).
I Have to multiply ints instead of bits!I Inductive on n, reducing to many small matrices.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 14 / 18
Main performance considerations
Matrices A,B ∈ Bn×n are on the CPU:I either multiply them on the CPU,I or send to the GPU (and use which method?).
If n < 200, faster to multiply than to transfer.
If n > 50000, will not fit on the GPU.I Processing by parts.
Direct n3 multiplication.I For n > 100 already superceded.
Arlazarov et al.: n3
log n operations.I Basic operation: union of rows.I Works well on GPU.
Strassen’s method: O(nlog2 7).
I Have to multiply ints instead of bits!I Inductive on n, reducing to many small matrices.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 14 / 18
Main performance considerations
Matrices A,B ∈ Bn×n are on the CPU:I either multiply them on the CPU,I or send to the GPU (and use which method?).
If n < 200, faster to multiply than to transfer.
If n > 50000, will not fit on the GPU.I Processing by parts.
Direct n3 multiplication.I For n > 100 already superceded.
Arlazarov et al.: n3
log n operations.I Basic operation: union of rows.I Works well on GPU.
Strassen’s method: O(nlog2 7).I Have to multiply ints instead of bits!
I Inductive on n, reducing to many small matrices.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 14 / 18
Main performance considerations
Matrices A,B ∈ Bn×n are on the CPU:I either multiply them on the CPU,I or send to the GPU (and use which method?).
If n < 200, faster to multiply than to transfer.
If n > 50000, will not fit on the GPU.I Processing by parts.
Direct n3 multiplication.I For n > 100 already superceded.
Arlazarov et al.: n3
log n operations.I Basic operation: union of rows.I Works well on GPU.
Strassen’s method: O(nlog2 7).I Have to multiply ints instead of bits!I Inductive on n, reducing to many small matrices.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 14 / 18
The O( n3
log n) method on a GPUMaking a table for B
Matrix B ∈ Bn×n on the GPU.
For each block of lines i ∈ {0, . . . , nk − 1},create table T [i ] ∈ B2k×n.
Line (bk−1 . . . b1b0)2 in T :disjunction of all lines with bj = 1.
Work items: every 64 bits in each line.
I 2k disjunctions of longs.I Threads access adjacent words.
Another dimension: T [i ] for different i .
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 15 / 18
The O( n3
log n) method on a GPUMaking a table for B
Matrix B ∈ Bn×n on the GPU.
For each block of lines i ∈ {0, . . . , nk − 1},create table T [i ] ∈ B2k×n.
Line (bk−1 . . . b1b0)2 in T :disjunction of all lines with bj = 1.
Work items: every 64 bits in each line.
I 2k disjunctions of longs.I Threads access adjacent words.
Another dimension: T [i ] for different i .
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 15 / 18
The O( n3
log n) method on a GPUMaking a table for B
Matrix B ∈ Bn×n on the GPU.
For each block of lines i ∈ {0, . . . , nk − 1},create table T [i ] ∈ B2k×n.
Line (bk−1 . . . b1b0)2 in T :disjunction of all lines with bj = 1.
Work items: every 64 bits in each line.
I 2k disjunctions of longs.I Threads access adjacent words.
Another dimension: T [i ] for different i .
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 15 / 18
The O( n3
log n) method on a GPUMaking a table for B
Matrix B ∈ Bn×n on the GPU.
For each block of lines i ∈ {0, . . . , nk − 1},create table T [i ] ∈ B2k×n.
Line (bk−1 . . . b1b0)2 in T :disjunction of all lines with bj = 1.
Work items: every 64 bits in each line.
I 2k disjunctions of longs.I Threads access adjacent words.
Another dimension: T [i ] for different i .
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 15 / 18
The O( n3
log n) method on a GPUMaking a table for B
Matrix B ∈ Bn×n on the GPU.
For each block of lines i ∈ {0, . . . , nk − 1},create table T [i ] ∈ B2k×n.
Line (bk−1 . . . b1b0)2 in T :disjunction of all lines with bj = 1.
Work items: every 64 bits in each line.I 2k disjunctions of longs.
I Threads access adjacent words.
Another dimension: T [i ] for different i .
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 15 / 18
The O( n3
log n) method on a GPUMaking a table for B
Matrix B ∈ Bn×n on the GPU.
For each block of lines i ∈ {0, . . . , nk − 1},create table T [i ] ∈ B2k×n.
Line (bk−1 . . . b1b0)2 in T :disjunction of all lines with bj = 1.
Work items: every 64 bits in each line.I 2k disjunctions of longs.I Threads access adjacent words.
Another dimension: T [i ] for different i .
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 15 / 18
The O( n3
log n) method on a GPUMaking a table for B
Matrix B ∈ Bn×n on the GPU.
For each block of lines i ∈ {0, . . . , nk − 1},create table T [i ] ∈ B2k×n.
Line (bk−1 . . . b1b0)2 in T :disjunction of all lines with bj = 1.
Work items: every 64 bits in each line.I 2k disjunctions of longs.I Threads access adjacent words.
Another dimension: T [i ] for different i .
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 15 / 18
The O( n3
log n) method on a GPUMultiplying the matrices
Matrix A ∈ Bn×n on the GPU.
nk tables T [i ] ∈ B2k×n on the GPU.
Compute the product A× B.
Work items: lines of A (and C ).
Step 1: cache the line of A to local memory.
Block-column of A determines the number of the table.
1× k block of A indexes the table.
Disjunction with the line of C .
Second dimension: every 64 bits in each line of T and C .
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 16 / 18
The O( n3
log n) method on a GPUMultiplying the matrices
Matrix A ∈ Bn×n on the GPU.nk tables T [i ] ∈ B2k×n on the GPU.
Compute the product A× B.
Work items: lines of A (and C ).
Step 1: cache the line of A to local memory.
Block-column of A determines the number of the table.
1× k block of A indexes the table.
Disjunction with the line of C .
Second dimension: every 64 bits in each line of T and C .
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 16 / 18
The O( n3
log n) method on a GPUMultiplying the matrices
Matrix A ∈ Bn×n on the GPU.nk tables T [i ] ∈ B2k×n on the GPU.
Compute the product A× B.
Work items: lines of A (and C ).
Step 1: cache the line of A to local memory.
Block-column of A determines the number of the table.
1× k block of A indexes the table.
Disjunction with the line of C .
Second dimension: every 64 bits in each line of T and C .
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 16 / 18
The O( n3
log n) method on a GPUMultiplying the matrices
Matrix A ∈ Bn×n on the GPU.nk tables T [i ] ∈ B2k×n on the GPU.
Compute the product A× B.
Work items: lines of A (and C ).
Step 1: cache the line of A to local memory.
Block-column of A determines the number of the table.
1× k block of A indexes the table.
Disjunction with the line of C .
Second dimension: every 64 bits in each line of T and C .
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 16 / 18
The O( n3
log n) method on a GPUMultiplying the matrices
Matrix A ∈ Bn×n on the GPU.nk tables T [i ] ∈ B2k×n on the GPU.
Compute the product A× B.
Work items: lines of A (and C ).
Step 1: cache the line of A to local memory.
Block-column of A determines the number of the table.
1× k block of A indexes the table.
Disjunction with the line of C .
Second dimension: every 64 bits in each line of T and C .
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 16 / 18
The O( n3
log n) method on a GPUMultiplying the matrices
Matrix A ∈ Bn×n on the GPU.nk tables T [i ] ∈ B2k×n on the GPU.
Compute the product A× B.
Work items: lines of A (and C ).
Step 1: cache the line of A to local memory.
Block-column of A determines the number of the table.
1× k block of A indexes the table.
Disjunction with the line of C .
Second dimension: every 64 bits in each line of T and C .
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 16 / 18
The O( n3
log n) method on a GPUMultiplying the matrices
Matrix A ∈ Bn×n on the GPU.nk tables T [i ] ∈ B2k×n on the GPU.
Compute the product A× B.
Work items: lines of A (and C ).
Step 1: cache the line of A to local memory.
Block-column of A determines the number of the table.
1× k block of A indexes the table.
Disjunction with the line of C .
Second dimension: every 64 bits in each line of T and C .
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 16 / 18
The O( n3
log n) method on a GPUMultiplying the matrices
Matrix A ∈ Bn×n on the GPU.nk tables T [i ] ∈ B2k×n on the GPU.
Compute the product A× B.
Work items: lines of A (and C ).
Step 1: cache the line of A to local memory.
Block-column of A determines the number of the table.
1× k block of A indexes the table.
Disjunction with the line of C .
Second dimension: every 64 bits in each line of T and C .
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 16 / 18
The O( n3
log n) method on a GPUMultiplying the matrices
Matrix A ∈ Bn×n on the GPU.nk tables T [i ] ∈ B2k×n on the GPU.
Compute the product A× B.
Work items: lines of A (and C ).
Step 1: cache the line of A to local memory.
Block-column of A determines the number of the table.
1× k block of A indexes the table.
Disjunction with the line of C .
Second dimension: every 64 bits in each line of T and C .
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 16 / 18
The O( n3
log n) method on a GPUMultiplying the matrices
Matrix A ∈ Bn×n on the GPU.nk tables T [i ] ∈ B2k×n on the GPU.
Compute the product A× B.
Work items: lines of A (and C ).
Step 1: cache the line of A to local memory.
Block-column of A determines the number of the table.
1× k block of A indexes the table.
Disjunction with the line of C .
Second dimension: every 64 bits in each line of T and C .
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 16 / 18
Performance
n = 2048, k = 8.
CPUNvidia G210M
(low-end laptop GPU)
Nvidia GTS250(average gaming card)
Time
234 ms 17.4 ms 3.3 ms
Memory access
9.4 GB/s 51.9 GB/s
Basically, bandwidth-limited.
The cores could compute more!
Optimization: cache more in local memory.
I Local memory: usually 16 KB per core.I Compute the table by parts.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 17 / 18
Performance
n = 2048, k = 8.
CPUNvidia G210M
(low-end laptop GPU)
Nvidia GTS250(average gaming card)
Time
234 ms 17.4 ms 3.3 ms
Memory access
9.4 GB/s 51.9 GB/s
Basically, bandwidth-limited.
The cores could compute more!
Optimization: cache more in local memory.
I Local memory: usually 16 KB per core.I Compute the table by parts.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 17 / 18
Performance
n = 2048, k = 8.
CPUNvidia G210M
(low-end laptop GPU)
Nvidia GTS250(average gaming card)
Time 234 ms 17.4 ms 3.3 msMemory access
9.4 GB/s 51.9 GB/s
Basically, bandwidth-limited.
The cores could compute more!
Optimization: cache more in local memory.
I Local memory: usually 16 KB per core.I Compute the table by parts.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 17 / 18
Performance
n = 2048, k = 8.
CPUNvidia G210M
(low-end laptop GPU)
Nvidia GTS250(average gaming card)
Time 234 ms 17.4 ms 3.3 msMemory access 9.4 GB/s 51.9 GB/s
Basically, bandwidth-limited.
The cores could compute more!
Optimization: cache more in local memory.
I Local memory: usually 16 KB per core.I Compute the table by parts.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 17 / 18
Performance
n = 2048, k = 8.
CPUNvidia G210M
(low-end laptop GPU)
Nvidia GTS250(average gaming card)
Time 234 ms 17.4 ms 3.3 msMemory access 9.4 GB/s 51.9 GB/s
Basically, bandwidth-limited.
The cores could compute more!
Optimization: cache more in local memory.
I Local memory: usually 16 KB per core.I Compute the table by parts.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 17 / 18
Performance
n = 2048, k = 8.
CPUNvidia G210M
(low-end laptop GPU)
Nvidia GTS250(average gaming card)
Time 234 ms 17.4 ms 3.3 msMemory access 9.4 GB/s 51.9 GB/s
Basically, bandwidth-limited.
The cores could compute more!
Optimization: cache more in local memory.
I Local memory: usually 16 KB per core.I Compute the table by parts.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 17 / 18
Performance
n = 2048, k = 8.
CPUNvidia G210M
(low-end laptop GPU)
Nvidia GTS250(average gaming card)
Time 234 ms 17.4 ms 3.3 msMemory access 9.4 GB/s 51.9 GB/s
Basically, bandwidth-limited.
The cores could compute more!
Optimization: cache more in local memory.
I Local memory: usually 16 KB per core.I Compute the table by parts.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 17 / 18
Performance
n = 2048, k = 8.
CPUNvidia G210M
(low-end laptop GPU)
Nvidia GTS250(average gaming card)
Time 234 ms 17.4 ms 3.3 msMemory access 9.4 GB/s 51.9 GB/s
Basically, bandwidth-limited.
The cores could compute more!
Optimization: cache more in local memory.I Local memory: usually 16 KB per core.
I Compute the table by parts.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 17 / 18
Performance
n = 2048, k = 8.
CPUNvidia G210M
(low-end laptop GPU)
Nvidia GTS250(average gaming card)
Time 234 ms 17.4 ms 3.3 msMemory access 9.4 GB/s 51.9 GB/s
Basically, bandwidth-limited.
The cores could compute more!
Optimization: cache more in local memory.I Local memory: usually 16 KB per core.I Compute the table by parts.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 17 / 18
Future work in this project
1 Refactor to use more local memory.
2 Implement multiplication of huge matrices.
3 Do a practical comparison with Strassen.
4 For the parsing application:better performance on smaller matrices.
I Large matrices are handled fast enough.I 128x128 and 256x256 matrices dominate the running time.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 18 / 18
Future work in this project
1 Refactor to use more local memory.
2 Implement multiplication of huge matrices.
3 Do a practical comparison with Strassen.
4 For the parsing application:better performance on smaller matrices.
I Large matrices are handled fast enough.I 128x128 and 256x256 matrices dominate the running time.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 18 / 18
Future work in this project
1 Refactor to use more local memory.
2 Implement multiplication of huge matrices.
3 Do a practical comparison with Strassen.
4 For the parsing application:better performance on smaller matrices.
I Large matrices are handled fast enough.I 128x128 and 256x256 matrices dominate the running time.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 18 / 18
Future work in this project
1 Refactor to use more local memory.
2 Implement multiplication of huge matrices.
3 Do a practical comparison with Strassen.
4 For the parsing application:better performance on smaller matrices.
I Large matrices are handled fast enough.I 128x128 and 256x256 matrices dominate the running time.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 18 / 18
Future work in this project
1 Refactor to use more local memory.
2 Implement multiplication of huge matrices.
3 Do a practical comparison with Strassen.
4 For the parsing application:better performance on smaller matrices.
I Large matrices are handled fast enough.
I 128x128 and 256x256 matrices dominate the running time.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 18 / 18
Future work in this project
1 Refactor to use more local memory.
2 Implement multiplication of huge matrices.
3 Do a practical comparison with Strassen.
4 For the parsing application:better performance on smaller matrices.
I Large matrices are handled fast enough.I 128x128 and 256x256 matrices dominate the running time.
Alexander Okhotin Boolean matrix multiplication on a GPU Hamburg, 12.04.2010 18 / 18