strassen’s algorithm for tensor contraction quick tips

1
Strassen’s Algorithm for Tensor Contrac4on The University of Texas at Austin Jianyu Huang, Devin A. Matthews, Robert A. van de Geijn Strassen’s Algorithm for Tensor Contraction May 2015 Nov 2015 Performance Experiments Tensor contraction (TC) is an important computational kernel widely used in numerous applications. It is a multi-dimensional generalization of matrix multiplication (GEMM). While Strassen’s algorithm for GEMM is well studied in theory and practice, extending it to accelerate TC has not been previously pursued. Thus, we believe this to be the first work to demonstrate how one can in practice speed up tensor contraction with Strassen’s algorithm. By adopting a Block-Scatter-Matrix format, a novel matrix-centric tensor layout, we can conceptually view TC as GEMM for a general stride storage, with an implicit tensor-to-matrix transformation. This insight enables us to tailor a recent state-of-the-art implementation of Strassen’s algorithm to a recent state-of-the-art TC, avoiding explicit transpositions (permutations) and extra workspace, and reducing the overhead of memory movement that is incurred. Performance benefits are demonstrated with a performance model as well as in practice on modern single core, multicore, and distributed memory parallel architectures, achieving up to 1.3× speedup. The resulting implementations can serve as a drop-in replacement for various applications with significant speedup. M 0 := (A 00 +A 11 )(B 00 +B 11 ); M 1 := (A 10 +A 11 )B 00 ; M 2 := A 00 (B 01 B 11 ); M 3 := A 11 (B 10 B 00 ); M 4 := (A 00 +A 01 )B 11 ; M 5 := (A 10 A 00 )(B 00 +B 01 ); M 6 := (A 01 A 11 )(B 10 +B 11 ); C 00 += M 0 + M 3 M 4 + M 6 C 01 += M 2 + M 4 C 10 += M 1 + M 3 C 11 += M 0 M 1 + M 2 + M 5 M 0 := (A 00 +A 11 )(B 00 +B 11 ); C 00 += M 0 ; C 11 += M 0 ; M 1 := (A 10 +A 11 )B 00 ; C 10 += M 1 ; C 11 –= M 1 ; M 2 := A 00 (B 01 B 11 ); C 01 += M 2 ; C 11 += M 2 ; M 3 := A 11 (B 10 B 00 ); C 00 += M 3 ; C 10 += M 3 ; M 4 := (A 00 +A 01 )B 11 ; C 01 += M 4 ; C 00 –= M 4 ; M 5 := (A 10 A 00 )(B 00 +B 01 ); C 11 += M 5 ; M 6 := (A 01 A 11 )(B 10 +B 11 ); C 00 += M 6 ; C += AB; M := (X+Y)(V+W); C += M; D += M; High-performance GEMM / Strassen’s Algroithm Matrix Multiplication Tensor Contraction Implementa4on Varia4ons [1] Jianyu Huang, Tyler M. Smith, Greg H. Henry, and Robert A. van de Geijn. “Strassen’s Algorithm Reloaded.” In SC’16. [2] Jianyu Huang, Devin A. Matthews, and Robert A. van de Geijn, “Strassen’s Algorithm for Tensor Contraction.” arXiv:1704.03092 (2017). [3] Devin A. Matthews, “High-Performance Tensor Contraction without Transposition.” Accepted in SISC. [4] Paul Springer, and Paolo Bientinesi. “Design of a high-performance gemm-like tensor-tensor multiplication.” arXiv:1607.00145 (2016). [5] Field G. Van Zee, and Robert A. van de Geijn. “BLIS: A framework for rapidly instantiating BLAS functionality.” TOMS 41, no. 3 (2015): 14. q Naïve Strassen TC: A traditional implementation with temporary buffers. q AB Strassen TC: Integrate the addition of tensors into and . q ABC Strassen TC: AB Strassen TC. Additionally integrate the update of multiple submatrices of matrix representation of C in the micro-kernel. Tensors as Matrices: Block-ScaHer-Matrix View Ø Tensor: , with . “d” dimension is stride-1, other dimensions have increasing strides (8, 16). Ø Matrix: , with . Column “ac” dimension has stride of “c” (8x2=16). Row “d” dimension has is stirde-1. (i.e. M is row-major.) Blocking lets us use the constant stride when possible, and a scattered algorithm otherwise. Ø Matrix multiplication: A, B, and C are eventually partitioned into small, fixed-sized blocks for packing or microkernel, such as 4x4 (picture below), 8x4, 6x8, etc. Ø Tensor Contraction: The tensor blocks can be encoded into regular small matrix blocks whenever possible, so that no overhead is incurred. 8x2x4 8x8 q , store offset for each position in rows or columns. q , store stride for each block or zero for irregular blocks. The block scatter vectors help to utilize efficient SIMD vector load/store instructions for stride-one index, or vector gather/scatter fetch instructions for stride-n index. Performance of various implementations for synthetic data on single core and one socket. Top row: actual and modeled performance on single core; Bottom Row: actual performance on one socket. Left column: ; Middle column: varies; Right row: vary. Weak scalability performance result of the various implementations for a 4-D tensor contraction CCSD application on distributed memory: . CTF shows the performance of the Cyclops Tensor Framework (linked with Intel MKL). Performance for representative user cases of benchmark from [4]. TC is identified by the index string, with the tensor index bundle of each tensor in the order , e.g. is denoted as . Left: performance on single core. Right: performance on one socket. General Operation: M := (X+δY)(V+εW); C += γ 0 M; D += γ 1 M; γ 0 , γ 1 ,δ,ε {-1, 0, 1}. (a) Tensor contraction with , and . The relative location of each data element in memory is given assuming a generalized column-major layout. An example to illustrate Strassen's algorithm for tensor contraction. The red lines denotes Strassen’s algorithm partitions mapping from block scatter matrix view (bottom) to the original tensor (top). In this example the partitions are regular subtensors, but this is not required in general. (b) Block scatter matrix view of (a), where , and are mapped to matrices , and fs : and denote the scatter vectors; and denote the block scatter vectors. Element locations are given by the sum of the row and column scatter vector entries.

Upload: others

Post on 26-Apr-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Strassen’s Algorithm for Tensor Contraction QUICK TIPS

QUICK DESIGN GUIDE (--THIS SECTION DOES NOT PRINT--)

This PowerPoint 2007 template produces a 36x48 inch professional poster. You can use it to create your research poster and save valuable time placing titles, subtitles, text, and graphics. We provide a series of online tutorials that will guide you through the poster design process and answer your poster production questions. To view our template tutorials, go online to PosterPresentations.com and click on HELP DESK. When you are ready to print your poster, go online to PosterPresentations.com. Need Assistance? Call us at 1.866.649.3004

Object Placeholders

Using the placeholders To add text, click inside a placeholder on the poster and type or paste your text. To move a placeholder, click it once (to select it). Place your cursor on its frame, and your cursor will change to this symbol Click once and drag it to a new location where you can resize it. Section Header placeholder Click and drag this preformatted section header placeholder to the poster area to add another section header. Use section headers to separate topics or concepts within your presentation. Text placeholder Move this preformatted text placeholder to the poster to add a new body of text. Picture placeholder Move this graphic placeholder onto your poster, size it first, and then click it to add a picture to the poster.

Student discounts are available on our Facebook page. Go to PosterPresentations.com and click on the FB icon.

QUICK TIPS (--THIS SECTION DOES NOT PRINT--)

This PowerPoint template requires basic PowerPoint (version 2007 or newer) skills. Below is a list of commonly asked questions specific to this template. If you are using an older version of PowerPoint some template features may not work properly.

Template FAQs

Verifying the quality of your graphics Go to the VIEW menu and click on ZOOM to set your preferred magnification. This template is at 100% the size of the final poster. All text and graphics will be printed at 100% their size. To see what your poster will look like when printed, set the zoom to 100% and evaluate the quality of all your graphics before you submit your poster for printing. Modifying the layout This template has four different column layouts. Right-click your mouse on the background and click on LAYOUT to see the layout options. The columns in the provided layouts are fixed and cannot be moved but advanced users can modify any layout by going to VIEW and then SLIDE MASTER. Importing text and graphics from external sources TEXT: Paste or type your text into a pre-existing placeholder or drag in a new placeholder from the left side of the template. Move it anywhere as needed. PHOTOS: Drag in a picture placeholder, size it first, click in it and insert a photo from the menu. TABLES: You can copy and paste a table from an external document onto this poster template. To adjust the way the text fits within the cells of a table that has been pasted, right-click on the table, click FORMAT SHAPE then click on TEXT BOX and change the INTERNAL MARGIN values to 0.25. Modifying the color scheme To change the color scheme of this template go to the DESIGN menu and click on COLORS. You can choose from the provided color combinations or create your own.

©2013PosterPresenta/ons.com2117FourthStreet,[email protected]

Introduc4on

Strassen’sAlgorithmforTensorContrac4on

The University of Texas at Austin Jianyu Huang, Devin A. Matthews, Robert A. van de Geijn

Strassen’s Algorithm for Tensor Contraction

May 2015 Nov 2015

PerformanceExperiments

Tensor contraction (TC) is an important computational kernel widely used in numerous applications. It is a multi-dimensional generalization of matrix multiplication (GEMM). While Strassen’s algorithm for GEMM is well studied in theory and practice, extending it to accelerate TC has not been previously pursued. Thus, we believe this to be the first work to demonstrate how one can in practice speed up tensor contraction with Strassen’s algorithm. By adopting a Block-Scatter-Matrix format, a novel matrix-centric tensor layout, we can conceptually view TC as GEMM for a general stride storage, with an implicit tensor-to-matrix transformation. This insight enables us to tailor a recent state-of-the-art implementation of Strassen’s algorithm to a recent state-of-the-art TC, avoiding explicit transpositions (permutations) and extra workspace, and reducing the overhead of memory movement that is incurred. Performance benefits are demonstrated with a performance model as well as in practice on modern single core, multicore, and distributed memory parallel architectures, achieving up to 1.3× speedup. The resulting implementations can serve as a drop-in replacement for various applications with significant speedup. M0 := (A00+A11)(B00+B11);

M1 := (A10+A11)B00; M2 := A00(B01–B11); M3 := A11(B10–B00); M4 := (A00+A01)B11; M5 := (A10–A00)(B00+B01); M6 := (A01–A11)(B10+B11); C00 += M0 + M3 – M4 + M6 C01 += M2 + M4 C10 += M1 + M3 C11 += M0 – M1 + M2 + M5

M0 := (A00+A11)(B00+B11); C00 += M0; C11 += M0; M1 := (A10+A11)B00; C10 += M1; C11 –= M1; M2 := A00(B01–B11); C01 += M2; C11 += M2; M3 := A11(B10–B00); C00 += M3; C10 += M3; M4 := (A00+A01)B11; C01 += M4; C00 –= M4; M5 := (A10–A00)(B00+B01); C11 += M5; M6 := (A01–A11)(B10+B11); C00 += M6;

C += AB; M := (X+Y)(V+W); C += M; D += M;

High-performanceGEMM/Strassen’sAlgroithm

Matrix Multiplication Tensor Contraction

Implementa4onVaria4ons

[1] Jianyu Huang, Tyler M. Smith, Greg H. Henry, and Robert A. van de Geijn. “Strassen’s Algorithm Reloaded.” In SC’16. [2] Jianyu Huang, Devin A. Matthews, and Robert A. van de Geijn, “Strassen’s Algorithm for Tensor Contraction.” arXiv:1704.03092 (2017). [3] Devin A. Matthews, “High-Performance Tensor Contraction without Transposition.” Accepted in SISC. [4] Paul Springer, and Paolo Bientinesi. “Design of a high-performance gemm-like tensor-tensor multiplication.” arXiv:1607.00145 (2016). [5] Field G. Van Zee, and Robert A. van de Geijn. “BLIS: A framework for rapidly instantiating BLAS functionality.” TOMS 41, no. 3 (2015): 14.

q Naïve Strassen TC: A traditional implementation with temporary buffers. q AB Strassen TC: Integrate the addition of tensors into and . q ABC Strassen TC: AB Strassen TC. Additionally integrate the update of multiple

submatrices of matrix representation of C in the micro-kernel.

TensorsasMatrices:Block-ScaHer-MatrixViewØ  Tensor: , with . “d” dimension is stride-1, other dimensions have increasing strides (8, 16). Ø  Matrix: , with . Column “ac” dimension has stride of “c” (8x2=16). Row “d” dimension has is stirde-1. (i.e. M is row-major.) Blocking lets us use the constant stride when possible, and a scattered algorithm otherwise. Ø  Matrix multiplication: A, B, and C are eventually partitioned into small, fixed-sized blocks

for packing or microkernel, such as 4x4 (picture below), 8x4, 6x8, etc. Ø  Tensor Contraction: The tensor blocks can be encoded into regular small matrix blocks

whenever possible, so that no overhead is incurred.

8x2x4

8x8

q  , store offset for each position in rows or columns. q  , store stride for each block or zero for irregular blocks. The block scatter vectors help to utilize efficient SIMD vector load/store instructions for stride-one index, or vector gather/scatter fetch instructions for stride-n index.

Performance of various implementations for synthetic data on single core and one socket. Top row: actual and modeled performance on single core; Bottom Row: actual performance on one socket. Left column: ; Middle column: varies; Right row: vary.

Weak scalability performance result of the various implementations for a 4-D tensor contraction CCSD application on distributed memory: . CTF shows the performance of the Cyclops Tensor Framework (linked with Intel MKL).

Performance for representative user cases of benchmark from [4]. TC is identified by the index string, with the tensor index bundle of each tensor in the order , e.g. is denoted as . Left: performance on single core. Right: performance on one socket.

General Operation: M := (X+δY)(V+εW); C += γ0M; D += γ1M; γ0, γ1,δ,ε ∈{-1, 0, 1}. (a) Tensor contraction with , and . The relative location of each data element in memory is given assuming a generalized column-major layout.

An example to illustrate Strassen's algorithm for tensor contraction. The red lines denotes Strassen’s algorithm partitions mapping from block scatter matrix view (bottom) to the original tensor (top). In this example the partitions are regular subtensors, but this is not required in general.

(b) Block scatter matrix view of (a), where , and are mapped to matrices , and fs : and denote the scatter vectors; and denote the block scatter vectors. Element locations are given by the sum of the row and column scatter vector entries.