weapons of math induction for the war on parallel ... · robert van de geijn field van zee graduate...
TRANSCRIPT
![Page 1: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/1.jpg)
Weapons of Math Inductionfor the War on Parallel Programming Error
Robert A. van de Geijn
Department of Computer ScienceInstitute for Computational Engineering and Sciences
The University of Texas at Austin
ICES – Sept, 2010
http://www.cs.utexas.edu/users/flame/ 1
![Page 2: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/2.jpg)
Outline
1 Introduction
2 Notation
3 Deriving Algorithms to be Correct
4 From Correct Algorithm to Correct Code
5 Achieving High Performance
6 Fighting the War on Parallel Programming ErrorMultithreaded ArchitecturesDistributed Memory Parallel
7 Other Things I Could Talk About
8 How Do I Get to Use All This?
9 Conclusion
http://www.cs.utexas.edu/users/flame/ 2
![Page 3: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/3.jpg)
Introduction
1 Introduction
2 Notation
3 Deriving Algorithms to be Correct
4 From Correct Algorithm to Correct Code
5 Achieving High Performance
6 Fighting the War on Parallel Programming ErrorMultithreaded ArchitecturesDistributed Memory Parallel
7 Other Things I Could Talk About
8 How Do I Get to Use All This?
9 Conclusion
http://www.cs.utexas.edu/users/flame/ 3
![Page 4: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/4.jpg)
Introduction
The Team
UT-Austin
Faculty/StaffErnie ChanVictor EijkhoutMaggie MyersAndy TerrelRobert van de GeijnField Van Zee
Graduate StudentsBryan MarkerKyungjoo KimIsaac LeeArdavan PedramJack PoulsonMartin Schatz
UndergradsBurns HealyEileen MartinJon MonetteTyler RhodesRichard VerasNick Wiz
Univ. Jaume I, Spain
FacultyGregorio Quintana-OrtıEnrique Quintana-OrtıMercedes Marques
Graduate StudentsManuel FogueFrancisco D. Igual
RWTH Aachen
FacultyPaolo Bientinesi
http://www.cs.utexas.edu/users/flame/ 4
![Page 5: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/5.jpg)
Introduction
Sponsors
UT-Austin
Numerous NSF GrantsMicrosoftIntel
Univ. Jaume I
Ministerio de Ciencia e InnovacionClearspeedMicrosoftNvidia
http://www.cs.utexas.edu/users/flame/ 5
![Page 6: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/6.jpg)
Introduction
Who is this famous (former) Texan?
http://www.cs.utexas.edu/users/flame/ 6
![Page 7: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/7.jpg)
Introduction
“I mean, if 10 years from now, when you are doing somethingquick and dirty, you suddenly visualize that I am looking over yourshoulders and say to yourself ”Dijkstra would not have liked this”,well, that would be enough immortality for me.”
– Dijkstra
http://www.cs.utexas.edu/users/flame/ 7
![Page 8: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/8.jpg)
Introduction
“Literature professors read each other’s books. Why don’tcomputer science professors read each other’s programs?”
– Tim Mattson
http://www.cs.utexas.edu/users/flame/ 8
![Page 9: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/9.jpg)
Introduction
Why dense linear algebra libraries
Widely used in scientific computing
Well-define domain
Thought to be well-understood
Interesting case study
http://www.cs.utexas.edu/users/flame/ 9
![Page 10: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/10.jpg)
Introduction
Why dense linear algebra libraries
Widely used in scientific computing
Well-define domain
Thought to be well-understood
Interesting case study
http://www.cs.utexas.edu/users/flame/ 9
![Page 11: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/11.jpg)
Introduction
Why dense linear algebra libraries
Widely used in scientific computing
Well-define domain
Thought to be well-understood
Interesting case study
http://www.cs.utexas.edu/users/flame/ 9
![Page 12: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/12.jpg)
Introduction
Why dense linear algebra libraries
Widely used in scientific computing
Well-define domain
Thought to be well-understood
Interesting case study
http://www.cs.utexas.edu/users/flame/ 9
![Page 13: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/13.jpg)
Introduction
Why dense linear algebra libraries
Widely used in scientific computing
Well-define domain
Thought to be well-understood
Interesting case study
http://www.cs.utexas.edu/users/flame/ 9
![Page 14: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/14.jpg)
Introduction
LAPACK Cholesky factorization (dpotrf)
DO 20 J = 1, N, NB
*
* Update and factorize the current diagonal block and test
* for non-positive-definiteness.
*
JB = MIN( NB, N-J+1 )
CALL DSYRK( ’Lower’, ’No transpose’, JB, J-1, -ONE,
$ A( J, 1 ), LDA, ONE, A( J, J ), LDA )
CALL DPOTF2( ’Lower’, JB, A( J, J ), LDA, INFO )
IF( INFO.NE.0 )
$ GO TO 30
IF( J+JB.LE.N ) THEN
*
* Compute the current block column.
*
CALL DGEMM( ’No transpose’, ’Transpose’, N-J-JB+1, JB,
$ J-1, -ONE, A( J+JB, 1 ), LDA, A( J, 1 ),
$ LDA, ONE, A( J+JB, J ), LDA )
CALL DTRSM( ’Right’, ’Lower’, ’Transpose’, ’Non-unit’,
$ N-J-JB+1, JB, ONE, A( J, J ), LDA,
$ A( J+JB, J ), LDA )
END IF
20 CONTINUE
<deleted code>
GO TO 40
*
30 CONTINUE
INFO = INFO + J - 1
*
40 CONTINUE
RETURN
http://www.cs.utexas.edu/users/flame/ 11
![Page 15: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/15.jpg)
Introduction
LAPACK
Fortran-77 codes
One routine (algorithm) per operation in the library
Storage in column major order
Parallelism extracted from calls to multithreaded BLAS
Extracting parallelism increases synchronization and thuslimits performance
Column major order hurts data locality
LAPACK does not use modern coding techniques
http://www.cs.utexas.edu/users/flame/ 12
![Page 16: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/16.jpg)
Introduction
LAPACK
Fortran-77 codes
One routine (algorithm) per operation in the library
Storage in column major order
Parallelism extracted from calls to multithreaded BLAS
Extracting parallelism increases synchronization and thuslimits performance
Column major order hurts data locality
LAPACK does not use modern coding techniques
http://www.cs.utexas.edu/users/flame/ 12
![Page 17: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/17.jpg)
Introduction
LAPACK
Fortran-77 codes
One routine (algorithm) per operation in the library
Storage in column major order
Parallelism extracted from calls to multithreaded BLAS
Extracting parallelism increases synchronization and thuslimits performance
Column major order hurts data locality
LAPACK does not use modern coding techniques
http://www.cs.utexas.edu/users/flame/ 12
![Page 18: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/18.jpg)
Introduction
LAPACK
Fortran-77 codes
One routine (algorithm) per operation in the library
Storage in column major order
Parallelism extracted from calls to multithreaded BLAS
Extracting parallelism increases synchronization and thuslimits performance
Column major order hurts data locality
LAPACK does not use modern coding techniques
http://www.cs.utexas.edu/users/flame/ 12
![Page 19: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/19.jpg)
Introduction
LAPACK
Fortran-77 codes
One routine (algorithm) per operation in the library
Storage in column major order
Parallelism extracted from calls to multithreaded BLAS
Extracting parallelism increases synchronization and thuslimits performance
Column major order hurts data locality
LAPACK does not use modern coding techniques
http://www.cs.utexas.edu/users/flame/ 12
![Page 20: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/20.jpg)
Introduction
LAPACK
Fortran-77 codes
One routine (algorithm) per operation in the library
Storage in column major order
Parallelism extracted from calls to multithreaded BLAS
Extracting parallelism increases synchronization and thuslimits performance
Column major order hurts data locality
LAPACK does not use modern coding techniques
http://www.cs.utexas.edu/users/flame/ 12
![Page 21: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/21.jpg)
Introduction
LAPACK
Fortran-77 codes
One routine (algorithm) per operation in the library
Storage in column major order
Parallelism extracted from calls to multithreaded BLAS
Extracting parallelism increases synchronization and thuslimits performance
Column major order hurts data locality
LAPACK does not use modern coding techniques
http://www.cs.utexas.edu/users/flame/ 12
![Page 22: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/22.jpg)
Introduction
LAPACK
Fortran-77 codes
One routine (algorithm) per operation in the library
Storage in column major order
Parallelism extracted from calls to multithreaded BLAS
Extracting parallelism increases synchronization and thuslimits performance
Column major order hurts data locality
LAPACK does not use modern coding techniques
http://www.cs.utexas.edu/users/flame/ 12
![Page 23: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/23.jpg)
Introduction
The sky is falling
http://www.cs.utexas.edu/users/flame/ 13
![Page 24: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/24.jpg)
Introduction
Evolution vs intelligent design
Parallelism is thrust upon the masses
Popular libraries like LAPACK must be completely rewritten(end of an evolutionary path)
Great, let’s start over
Cheaper than trying to evolve?
http://www.cs.utexas.edu/users/flame/ 14
![Page 25: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/25.jpg)
Introduction
Evolution vs intelligent design
Parallelism is thrust upon the masses
Popular libraries like LAPACK must be completely rewritten(end of an evolutionary path)
Great, let’s start over
Cheaper than trying to evolve?
http://www.cs.utexas.edu/users/flame/ 14
![Page 26: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/26.jpg)
Introduction
Evolution vs intelligent design
Parallelism is thrust upon the masses
Popular libraries like LAPACK must be completely rewritten(end of an evolutionary path)
Great, let’s start over
Cheaper than trying to evolve?
http://www.cs.utexas.edu/users/flame/ 14
![Page 27: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/27.jpg)
Introduction
Evolution vs intelligent design
Parallelism is thrust upon the masses
Popular libraries like LAPACK must be completely rewritten(end of an evolutionary path)
Great, let’s start over
Cheaper than trying to evolve?
http://www.cs.utexas.edu/users/flame/ 14
![Page 28: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/28.jpg)
Introduction
Evolution vs intelligent design
Parallelism is thrust upon the masses
Popular libraries like LAPACK must be completely rewritten(end of an evolutionary path)
Great, let’s start over
Cheaper than trying to evolve?
http://www.cs.utexas.edu/users/flame/ 14
![Page 29: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/29.jpg)
Introduction
FLAME
Notation for expressing algorithms
Systematic derivation procedure
Families of algorithms for each operation
APIs to transform algorithms into codes
Storage and algorithm are independent
Storage-by-blocks
Parallelism with data dependencies
High performance even on “exotic” architectures likemultiGPUs
A new distributed memory library for massively parallelclusters and clusters-on-a-chip
http://www.cs.utexas.edu/users/flame/ 15
![Page 30: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/30.jpg)
Introduction
FLAME
Notation for expressing algorithms
Systematic derivation procedure
Families of algorithms for each operation
APIs to transform algorithms into codes
Storage and algorithm are independent
Storage-by-blocks
Parallelism with data dependencies
High performance even on “exotic” architectures likemultiGPUs
A new distributed memory library for massively parallelclusters and clusters-on-a-chip
http://www.cs.utexas.edu/users/flame/ 15
![Page 31: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/31.jpg)
Introduction
FLAME
Notation for expressing algorithms
Systematic derivation procedure
Families of algorithms for each operation
APIs to transform algorithms into codes
Storage and algorithm are independent
Storage-by-blocks
Parallelism with data dependencies
High performance even on “exotic” architectures likemultiGPUs
A new distributed memory library for massively parallelclusters and clusters-on-a-chip
http://www.cs.utexas.edu/users/flame/ 15
![Page 32: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/32.jpg)
Introduction
FLAME
Notation for expressing algorithms
Systematic derivation procedure
Families of algorithms for each operation
APIs to transform algorithms into codes
Storage and algorithm are independent
Storage-by-blocks
Parallelism with data dependencies
High performance even on “exotic” architectures likemultiGPUs
A new distributed memory library for massively parallelclusters and clusters-on-a-chip
http://www.cs.utexas.edu/users/flame/ 15
![Page 33: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/33.jpg)
Introduction
FLAME
Notation for expressing algorithms
Systematic derivation procedure
Families of algorithms for each operation
APIs to transform algorithms into codes
Storage and algorithm are independent
Storage-by-blocks
Parallelism with data dependencies
High performance even on “exotic” architectures likemultiGPUs
A new distributed memory library for massively parallelclusters and clusters-on-a-chip
http://www.cs.utexas.edu/users/flame/ 15
![Page 34: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/34.jpg)
Introduction
FLAME
Notation for expressing algorithms
Systematic derivation procedure
Families of algorithms for each operation
APIs to transform algorithms into codes
Storage and algorithm are independent
Storage-by-blocks
Parallelism with data dependencies
High performance even on “exotic” architectures likemultiGPUs
A new distributed memory library for massively parallelclusters and clusters-on-a-chip
http://www.cs.utexas.edu/users/flame/ 15
![Page 35: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/35.jpg)
Introduction
FLAME
Notation for expressing algorithms
Systematic derivation procedure
Families of algorithms for each operation
APIs to transform algorithms into codes
Storage and algorithm are independent
Storage-by-blocks
Parallelism with data dependencies
High performance even on “exotic” architectures likemultiGPUs
A new distributed memory library for massively parallelclusters and clusters-on-a-chip
http://www.cs.utexas.edu/users/flame/ 15
![Page 36: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/36.jpg)
Introduction
FLAME
Notation for expressing algorithms
Systematic derivation procedure
Families of algorithms for each operation
APIs to transform algorithms into codes
Storage and algorithm are independent
Storage-by-blocks
Parallelism with data dependencies
High performance even on “exotic” architectures likemultiGPUs
A new distributed memory library for massively parallelclusters and clusters-on-a-chip
http://www.cs.utexas.edu/users/flame/ 15
![Page 37: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/37.jpg)
Introduction
FLAME
Notation for expressing algorithms
Systematic derivation procedure
Families of algorithms for each operation
APIs to transform algorithms into codes
Storage and algorithm are independent
Storage-by-blocks
Parallelism with data dependencies
High performance even on “exotic” architectures likemultiGPUs
A new distributed memory library for massively parallelclusters and clusters-on-a-chip
http://www.cs.utexas.edu/users/flame/ 15
![Page 38: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/38.jpg)
Introduction
FLAME
Notation for expressing algorithms
Systematic derivation procedure
Families of algorithms for each operation
APIs to transform algorithms into codes
Storage and algorithm are independent
Storage-by-blocks
Parallelism with data dependencies
High performance even on “exotic” architectures likemultiGPUs
A new distributed memory library for massively parallelclusters and clusters-on-a-chip
http://www.cs.utexas.edu/users/flame/ 15
![Page 39: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/39.jpg)
Introduction
General lesson:
This crisis is an opportunity to completely rethink your code.
http://www.cs.utexas.edu/users/flame/ 16
![Page 40: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/40.jpg)
Introduction
To keep you interested ...
0
100
200
300
400
500
600
700
0 5000 10000 15000 20000
GFL
OPS
Matrix size
Performance of the Cholesky factorization on GPU/CPU
MKL 10.0 spotrf on two Intel Xeon QuadCore (2.2 GHz)Algorithm-by-blocks on Tesla S870
Algorithm-by-blocks on Tesla S1070
http://www.cs.utexas.edu/users/flame/ 17
![Page 41: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/41.jpg)
Notation
1 Introduction
2 Notation
3 Deriving Algorithms to be Correct
4 From Correct Algorithm to Correct Code
5 Achieving High Performance
6 Fighting the War on Parallel Programming ErrorMultithreaded ArchitecturesDistributed Memory Parallel
7 Other Things I Could Talk About
8 How Do I Get to Use All This?
9 Conclusion
http://www.cs.utexas.edu/users/flame/ 18
![Page 42: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/42.jpg)
Notation
A Motivating Example: The Cholesky Factorization
Given A→ n× n symmetric positive definite, compute
A = L · LT ,
where L is an n× n lower triangular triangular matrix
http://www.cs.utexas.edu/users/flame/ 19
![Page 43: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/43.jpg)
Notation
The Cholesky Factorization: On the Whiteboard
done
done
done
A(partially
updated)
?
α11 ?
a21 A22
?
α11:=√α11
?
a21:=a21/α11
A22:=
A22−a21aT21
?
done
done
done
A(partially
updated)
http://www.cs.utexas.edu/users/flame/ 20
![Page 44: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/44.jpg)
Notation
FLAME Notation
done
done
done
A(partially
updated)
?
α11 aT12
a21 A22
Repartition„ATL ATR
ABL ABR
«
→
0BB@A00 a01 A02
aT10 α11 aT
12
A20 a21 A22
1CCAwhere α11 is a scalar
http://www.cs.utexas.edu/users/flame/ 21
![Page 45: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/45.jpg)
Notation
Algorithm: [A] := Chol unb(A)
Partition A→(
AT L AT R
ABL ABR
)where ATL is 0× 0
while n(ABR) 6= 0 do
Repartition(ATL ATR
ABL ABR
)→
(A00 a01 A02
aT10 α11 aT
12
A20 a21 A22
)where α11 is a scalar
α11 :=√
α11
a21 := a21/α11
A22 := A22 − a21aT21 (syr)
Continue with(ATL ATR
ABL ABR
)←
(A00 a01 A02
aT10 α11 aT
12
A20 a21 A22
)endwhile
http://www.cs.utexas.edu/users/flame/ 22
![Page 46: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/46.jpg)
Notation
General lesson:
Algorithms should be represented in a way that captures how wereason about them.
http://www.cs.utexas.edu/users/flame/ 23
![Page 47: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/47.jpg)
Deriving Algorithms to be Correct
1 Introduction
2 Notation
3 Deriving Algorithms to be Correct
4 From Correct Algorithm to Correct Code
5 Achieving High Performance
6 Fighting the War on Parallel Programming ErrorMultithreaded ArchitecturesDistributed Memory Parallel
7 Other Things I Could Talk About
8 How Do I Get to Use All This?
9 Conclusion
http://www.cs.utexas.edu/users/flame/ 24
![Page 48: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/48.jpg)
Deriving Algorithms to be Correct
Family Values
http://www.cs.utexas.edu/users/flame/ 25
![Page 49: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/49.jpg)
Deriving Algorithms to be Correct
“The only effective way to raise the confidence level of a programsignificantly is to give a convincing proof of its correctness. Butone should not first make the program and then prove itscorrectness, because then the requirement of providing the proofwould only increase the poor programmers burden. On thecontrary: the programmer should let correctness proof andprogram grow hand in hand.”
– Dijkstra
http://www.cs.utexas.edu/users/flame/ 26
![Page 50: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/50.jpg)
Deriving Algorithms to be Correct
The Worksheet: A Weapon of Math Induction
http://www.cs.utexas.edu/users/flame/ 27
![Page 51: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/51.jpg)
Deriving Algorithms to be Correct
Step 1: Precondition and postcondition
Precondition: A = A
Note: A indicates the contents of A upon entry. We use thisdummy variable to be able to reason about the contents ofmatrix A as it is being overwritten by its Cholesky factor.
Postcondition: A = L ∧ A = LLT
Note: Indicates that upon completion A must contain theCholesky factor of the original matrix.
http://www.cs.utexas.edu/users/flame/ 28
![Page 52: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/52.jpg)
Deriving Algorithms to be Correct
Step 1: Precondition and postcondition
Precondition: A = A
Note: A indicates the contents of A upon entry. We use thisdummy variable to be able to reason about the contents ofmatrix A as it is being overwritten by its Cholesky factor.
Postcondition: A = L ∧ A = LLT
Note: Indicates that upon completion A must contain theCholesky factor of the original matrix.
http://www.cs.utexas.edu/users/flame/ 28
![Page 53: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/53.jpg)
Deriving Algorithms to be Correct
Step 1: Precondition and postcondition
Precondition: A = A
Note: A indicates the contents of A upon entry. We use thisdummy variable to be able to reason about the contents ofmatrix A as it is being overwritten by its Cholesky factor.
Postcondition: A = L ∧ A = LLT
Note: Indicates that upon completion A must contain theCholesky factor of the original matrix.
http://www.cs.utexas.edu/users/flame/ 28
![Page 54: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/54.jpg)
Deriving Algorithms to be Correct
Step Annotated Algorithm: A := Chol unb var3(A)
1an
A = Ao
4
2
3 while m(ATL) < m(A) do2,3
5a
6
8
5b
7
2
endwhile2,3
1bn
A = L ∧ A = LLTo
http://www.cs.utexas.edu/users/flame/ 29
![Page 55: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/55.jpg)
Deriving Algorithms to be Correct
Step 2: Finding Loop-Invariants
Partition the operands
A→(
ATL ?
ABL ABR
)and L→
(LTL 0
LBL LBR
)
Plug into postcondition A = L ∧ A = LLT :(ATL ?
ABL ABR
)=
(LTL 0
LBL LBR
)∧
(ATL ?
ABL ABR
)=
(LTL 0
LBL LBR
)(LTL 0
LBL LBR
)T
http://www.cs.utexas.edu/users/flame/ 30
![Page 56: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/56.jpg)
Deriving Algorithms to be Correct
Step 2: Finding Loop-Invariants
Partition the operands
A→(
ATL ?
ABL ABR
)and L→
(LTL 0
LBL LBR
)
Plug into postcondition A = L ∧ A = LLT :(ATL ?
ABL ABR
)=
(LTL 0
LBL LBR
)∧
(ATL ?
ABL ABR
)=
(LTL 0
LBL LBR
)(LTL 0
LBL LBR
)T
http://www.cs.utexas.edu/users/flame/ 30
![Page 57: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/57.jpg)
Deriving Algorithms to be Correct
Determine the loop-invariants
„ATL ?
ABL ABR
«=
„LTL 0
LBL LBR
«∧
ATL ?
ABL ABR
!=
„LTLLT
TL ?
LBLLTTL LBLLT
BL + LBRLTBR
«
Loop-invariant 1:
„ATL ?
ABL ABR
«=
LTL 0
ABL ABR
!∧ ATL = LTLLT
TL
Loop-invariant 2:„ATL ?
ABL ABR
«=
LTL 0
LBL ABR
!∧
ATL
ABL
!=
„LTLLT
TL
LBLLTTL
«Loop-invariant 3:„
ATL ?
ABL ABR
«=
LTL 0
LBL ABR − LBLLTBL
!∧
ATL
ABL
!=
„LTLLT
TL
LBLLTTL
«
http://www.cs.utexas.edu/users/flame/ 31
![Page 58: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/58.jpg)
Deriving Algorithms to be Correct
Determine the loop-invariants
„ATL ?
ABL ABR
«=
„LTL 0
LBL LBR
«∧
ATL ?
ABL ABR
!=
„LTLLT
TL ?
LBLLTTL LBLLT
BL + LBRLTBR
«
Loop-invariant 1:
„ATL ?
ABL ABR
«=
LTL 0
ABL ABR
!∧ ATL = LTLLT
TL
Loop-invariant 2:„ATL ?
ABL ABR
«=
LTL 0
LBL ABR
!∧
ATL
ABL
!=
„LTLLT
TL
LBLLTTL
«Loop-invariant 3:„
ATL ?
ABL ABR
«=
LTL 0
LBL ABR − LBLLTBL
!∧
ATL
ABL
!=
„LTLLT
TL
LBLLTTL
«
http://www.cs.utexas.edu/users/flame/ 31
![Page 59: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/59.jpg)
Deriving Algorithms to be Correct
Determine the loop-invariants
„ATL ?
ABL ABR
«=
„LTL 0
LBL LBR
«∧
ATL ?
ABL ABR
!=
„LTLLT
TL ?
LBLLTTL LBLLT
BL + LBRLTBR
«
Loop-invariant 1:
„ATL ?
ABL ABR
«=
LTL 0
ABL ABR
!∧ ATL = LTLLT
TL
Loop-invariant 2:„ATL ?
ABL ABR
«=
LTL 0
LBL ABR
!∧
ATL
ABL
!=
„LTLLT
TL
LBLLTTL
«Loop-invariant 3:„
ATL ?
ABL ABR
«=
LTL 0
LBL ABR − LBLLTBL
!∧
ATL
ABL
!=
„LTLLT
TL
LBLLTTL
«
http://www.cs.utexas.edu/users/flame/ 31
![Page 60: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/60.jpg)
Deriving Algorithms to be Correct
Determine the loop-invariants
„ATL ?
ABL ABR
«=
„LTL 0
LBL LBR
«∧
ATL ?
ABL ABR
!=
„LTLLT
TL ?
LBLLTTL LBLLT
BL + LBRLTBR
«
Loop-invariant 1:
„ATL ?
ABL ABR
«=
LTL 0
ABL ABR
!∧ ATL = LTLLT
TL
Loop-invariant 2:„ATL ?
ABL ABR
«=
LTL 0
LBL ABR
!∧
ATL
ABL
!=
„LTLLT
TL
LBLLTTL
«Loop-invariant 3:„
ATL ?
ABL ABR
«=
LTL 0
LBL ABR − LBLLTBL
!∧
ATL
ABL
!=
„LTLLT
TL
LBLLTTL
«
http://www.cs.utexas.edu/users/flame/ 31
![Page 61: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/61.jpg)
Deriving Algorithms to be Correct
Step 2: Enter loop-invariant in worksheet
Step Annotated Algorithm: A := Chol unb var3(A)
1an
A = Ao
4
2
(„AT L ?
ABL ABR
«=
LT L ?
LBL ABR − LBLLTBL
!∧
AT L
ABL
!=
LT LLT
T L
LBLLTT L
!)3 while do
2,3
(„AT L ?
ABL ABR
«=
LT L ?
LBL ABR − LBLLTBL
!∧
AT L
ABL
!=
LT LLT
T L
LBLLTT L
!)∧ · · ·
5a685b7
2
(„AT L ?
ABL ABR
«=
LT L ?
LBL ABR − LBLLTBL
!∧
AT L
ABL
!=
LT LLT
T L
LBLLTT L
!)endwhile
2,3
(„AT L ?
ABL ABR
«=
LT L ?
LBL ABR − LBLLTBL
!∧
AT L
ABL
!=
LT LLT
T L
LBLLTT L
!)∧ · · ·
1bn
A = L ∧ A = LLTo
http://www.cs.utexas.edu/users/flame/ 32
![Page 62: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/62.jpg)
Deriving Algorithms to be Correct
Why a Weapon of Math Induction?
http://www.cs.utexas.edu/users/flame/ 33
![Page 63: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/63.jpg)
Deriving Algorithms to be Correct
Step 3: Finding the Loop-Guard
Step Annotated Algorithm: A := Chol unb var3(A)
1an
A = Ao
4
2
(„AT L ?
ABL ABR
«=
LT L ?
LBL ABR − LBLLTBL
!∧
AT L
ABL
!=
LT LLT
T L
LBLLTT L
!)3 while m(AT L) < m(A) do
2,3
(„AT L ?
ABL ABR
«=
LT L ?
LBL ABR − LBLLTBL
!∧
AT L
ABL
!=
LT LLT
T L
LBLLTT L
!)∧ m(AT L) < m(A)
5a685b7
2
(„AT L ?
ABL ABR
«=
LT L ?
LBL ABR − LBLLTBL
!∧
AT L
ABL
!=
LT LLT
T L
LBLLTT L
!)endwhile
2,3
(„AT L ?
ABL ABR
«=
LT L ?
LBL ABR − LBLLTBL
!∧
AT L
ABL
!=
LT LLT
T L
LBLLTT L
!)∧ ¬( m(AT L) < m(A) )
1bn
A = L ∧ A = LLTo
http://www.cs.utexas.edu/users/flame/ 34
![Page 64: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/64.jpg)
Deriving Algorithms to be Correct
Step 4: Finding the Initialization
Step Annotated Algorithm: A := Chol unb var3(A)
1an
A = Ao
4 Partition A→„
AT L ?
ABL ABR
«, L→
„LT L 0
LBL LBR
«where AT L and LT L are 0× 0
2
(„AT L ?
ABL ABR
«=
LT L ?
LBL ABR − LBLLTBL
!∧
AT L
ABL
!=
LT LLT
T L
LBLLTT L
!)3 while m(AT L) < m(A) do
2,3
( „AT L ?
ABL ABR
«=
LT L ?
LBL ABR − LBLLTBL
!∧
AT L
ABL
!=
LT LLT
T L
LBLLTT L
!!∧ (m(AT L) < m(A))
)5a685b7
2
(„AT L ?
ABL ABR
«=
LT L ?
LBL ABR − LBLLTBL
!∧
AT L
ABL
!=
LT LLT
T L
LBLLTT L
!)endwhile
2,3
( „AT L ?
ABL ABR
«=
LT L ?
LBL ABR − LBLLTBL
!∧
AT L
ABL
!=
LT LLT
T L
LBLLTT L
!!∧ ¬ (m(AT L) < m(A))
)1b
nA = L ∧ A = LLT
o
http://www.cs.utexas.edu/users/flame/ 35
![Page 65: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/65.jpg)
Deriving Algorithms to be Correct
Step 5: Marching through the MatrixStep Annotated Algorithm: A := Chol unb var3(A)
1an
A = Ao
4 Partition A→„
AT L ?
ABL ABR
«, L→
„LT L 0
LBL LBR
«where AT L and LT L are 0× 0
2
(„AT L ?
ABL ABR
«=
LT L ?
LBL ABR − LBLLTBL
!∧
AT L
ABL
!=
LT LLT
T L
LBLLTT L
!)3 while m(AT L) < m(A) do
2,3
( „AT L ?
ABL ABR
«=
LT L ?
LBL ABR − LBLLTBL
!∧
AT L
ABL
!=
LT LLT
T L
LBLLTT L
!!∧ (m(AT L) < m(A))
)5a Repartition„
AT L ?
ABL ABR
«→
0@A00 ? ?
aT10 α11 ?
A20 a21 A22
1A,
„LT L 0
LBL LBR
«→
0@L00 0 0
lT10 λ11 0L20 l21 L22
1Awhere α11 and λ11 are scalars
685b Continue with„
AT L ?
ABL ABR
«←
0@A00 ? ?
aT10 α11 ?
A20 a21 A22
1A,
„LT L 0
LBL LBR
«←
0@L00 0 0
lT10 λ11 0
L20 l21 L22
1A7
2
(„AT L ?
ABL ABR
«=
LT L ?
LBL ABR − LBLLTBL
!∧
AT L
ABL
!=
LT LLT
T L
LBLLTT L
!)endwhile
2,3
( „AT L ?
ABL ABR
«=
LT L ?
LBL ABR − LBLLTBL
!∧
AT L
ABL
!=
LT LLT
T L
LBLLTT L
!!∧ ¬ (m(AT L) < m(A))
)1b
nA = L ∧ A = LLT
ohttp://www.cs.utexas.edu/users/flame/ 36
![Page 66: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/66.jpg)
Deriving Algorithms to be Correct
Step 6: State Before the Update
.
.
.
.
.
.3 while m(AT L) < m(A) do
2,3
(„AT L ?
ABL ABR
«=
LT L ?
LBL ABR − LBLLTBL
!∧
AT L
ABL
!=
LT LLT
T L
LBLLTT L
!)5a Repartition„
AT L ?
ABL ABR
«→
0@A00 ? ?
aT10 α11 ?
A20 a21 A22
1A,
„LT L 0
LBL LBR
«→
0@L00 0 0
lT10 λ11 0L20 l21 L22
1Awhere α11 and λ11 are scalars
6
8><>:0@A00 ? ?
aT10 α11 ?
A20 a21 A22
1A =
0B@L00 ? ?
lT10 α11 − lT10l10 ?
L20 a21 − L20l10 A22 − L20LT20
1CA ∧0B@ A00
aT10
A20
1CA =
0B@L00LT00
lT10LT00
L20LT00
1CA9>=>;
85b Continue with„
AT L ?
ABL ABR
«←
0@A00 ? ?
aT10 α11 ?
A20 a21 A22
1A,
„LT L 0
LBL LBR
«←
0@L00 0 0
lT10 λ11 0
L20 l21 L22
1A7
2
(„AT L ?
ABL ABR
«=
LT L ?
LBL ABR − LBLLTBL
!∧
AT L
ABL
!=
LT LLT
T L
LBLLTT L
!)endwhile
.
.
.
.
.
.
http://www.cs.utexas.edu/users/flame/ 37
![Page 67: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/67.jpg)
Deriving Algorithms to be Correct
Step 7: State After the Update...
.
.
.3 while m(AT L) < m(A) do
2,3
(„AT L ?
ABL ABR
«=
LT L ?
LBL ABR − LBLLTBL
!∧
AT L
ABL
!=
LT LLT
T L
LBLLTT L
!)5a Repartition„
AT L ?
ABL ABR
«→
0@A00 ? ?
aT10 α11 ?
A20 a21 A22
1A,
„LT L 0
LBL LBR
«→
0@L00 0 0
lT10 λ11 0L20 l21 L22
1Awhere α11 and λ11 are scalars
6
8><>:0@A00 ? ?
aT10 α11 ?
A20 a21 A22
1A =
0B@L00 ? ?
lT10 α11 − lT10l10 ?
L20 a21 − L20l10 A22 − L20LT20
1CA ∧0B@ A00
aT10
A20
1CA =
0B@L00LT00
lT10LT00
L20LT00
1CA9>=>;
85b Continue with„
AT L ?
ABL ABR
«←
0@A00 ? ?
aT10 α11 ?
A20 a21 A22
1A,
„LT L 0
LBL LBR
«←
0@L00 0 0
lT10 λ11 0
L20 l21 L22
1A
7
8>>>>>>><>>>>>>>:
0@A00 ? ?
aT10 α11 ?
A20 a21 A22
1A =
0B@L00 ? ?
lT10 λ11 ?
L20 l21 A22 − L20LT20 − l21lT21
1CA∧
0B@ A00 ?
aT10 α11
A20 a21
1CA =
0B@L00LT00 ?
lT10LT00 lT10l10 + λ2
11
L20LT00 L20l10 + l21λ11
1CA
9>>>>>>>=>>>>>>>;2
(„AT L ?
ABL ABR
«=
LT L ?
LBL ABR − LBLLTBL
!∧
AT L
ABL
!=
LT LLT
T L
LBLLTT L
!)endwhile
.
.
.
.
.
.http://www.cs.utexas.edu/users/flame/ 38
![Page 68: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/68.jpg)
Deriving Algorithms to be Correct
Step 8: The Update...
.
.
.3 while m(AT L) < m(A) do
2,3
(„AT L ?
ABL ABR
«=
LT L ?
LBL ABR − LBLLTBL
!∧
AT L
ABL
!=
LT LLT
T L
LBLLTT L
!)5a Repartition„
AT L ?
ABL ABR
«→
0@A00 ? ?
aT10 α11 ?
A20 a21 A22
1A,
„LT L 0
LBL LBR
«→
0@L00 0 0
lT10 λ11 0L20 l21 L22
1Awhere α11 and λ11 are scalars
6
8><>:0@A00 ? ?
aT10 α11 ?
A20 a21 A22
1A =
0B@L00 ? ?
lT10 α11 − lT10l10 ?
L20 a21 − L20l10 A22 − L20LT20
1CA ∧0B@ A00
aT10
A20
1CA =
0B@L00LT00
lT10LT00
L20LT00
1CA9>=>;
8
α11 :=√
α11a21 := a21/α11
A22 := A22 − a21aT21
5b Continue with„AT L ?
ABL ABR
«←
0@A00 ? ?
aT10 α11 ?
A20 a21 A22
1A,
„LT L 0
LBL LBR
«←
0@L00 0 0
lT10 λ11 0
L20 l21 L22
1A
7
8>>>>>>><>>>>>>>:
0@A00 ? ?
aT10 α11 ?
A20 a21 A22
1A =
0B@L00 ? ?
lT10 λ11 ?
L20 l21 A22 − L20LT20 − l21lT21
1CA∧
0B@ A00 ?
aT10 α11
A20 a21
1CA =
0B@L00LT00 ?
lT10LT00 lT10l10 + λ2
11
L20LT00 L20l10 + l21λ11
1CA
9>>>>>>>=>>>>>>>;2
(„AT L ?
ABL ABR
«=
LT L ?
LBL ABR − LBLLTBL
!∧
AT L
ABL
!=
LT LLT
T L
LBLLTT L
!)endwhile
.
.
.
.
.
.
http://www.cs.utexas.edu/users/flame/ 39
![Page 69: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/69.jpg)
Deriving Algorithms to be Correct
The Algorithm
Algorithm: A := Chol unb var3(A)
Partition A→„
ATL ?
ABL ABR
«where ATL is 0× 0
while m(ATL) < m(A) doRepartition„
ATL ?
ABL ABR
«→
0@ A00 ? ?
aT10 α11 ?
A20 a21 A22
1Awhere α11 is a scalars
α11 :=√
α11
a21 := a21/α11
A22 := A22 − a21aT21
Continue with„ATL ?
ABL ABR
«←
0@ A00 ? ?
aT10 α11 ?
A20 a21 A22
1Aendwhile
http://www.cs.utexas.edu/users/flame/ 40
![Page 70: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/70.jpg)
Deriving Algorithms to be Correct
Having families of correct algorithms is good
Don’t necessarily start with the legacy implementation of the“usual” algorithm. It may not parallelize well.
Find all (most) algorithms and pick the best for the targetarchitecture.
In our case, we can systematically generate all (loop-based)algorithms.
http://www.cs.utexas.edu/users/flame/ 41
![Page 71: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/71.jpg)
Deriving Algorithms to be Correct
Having families of correct algorithms is good
Don’t necessarily start with the legacy implementation of the“usual” algorithm. It may not parallelize well.
Find all (most) algorithms and pick the best for the targetarchitecture.
In our case, we can systematically generate all (loop-based)algorithms.
http://www.cs.utexas.edu/users/flame/ 41
![Page 72: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/72.jpg)
Deriving Algorithms to be Correct
Having families of correct algorithms is good
Don’t necessarily start with the legacy implementation of the“usual” algorithm. It may not parallelize well.
Find all (most) algorithms and pick the best for the targetarchitecture.
In our case, we can systematically generate all (loop-based)algorithms.
http://www.cs.utexas.edu/users/flame/ 41
![Page 73: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/73.jpg)
Deriving Algorithms to be Correct
Having families of correct algorithms is good
Don’t necessarily start with the legacy implementation of the“usual” algorithm. It may not parallelize well.
Find all (most) algorithms and pick the best for the targetarchitecture.
In our case, we can systematically generate all (loop-based)algorithms.
http://www.cs.utexas.edu/users/flame/ 41
![Page 74: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/74.jpg)
Deriving Algorithms to be Correct
Is the methodology just a theoretical curiosity?
Broad applicability to all operations supported by LAPACK
Not yet: eigensolvers, SVD.
The methodology is sufficiently systematic that it has beenautomated (with Mathematica).Paolo Bientinesi. ”Mechanical Derivation and Systematic Analysis of Correct
Linear Algebra Algorithms.” Dissertation, UT-Austin, 2006.
Recently generalized to the derivation of Krylov subspacemethods.Victor Eijkhout, Paolo Bientinesi, and Robert van de Geijn. ”Toward
Mechanical Derivation of Krylov Solver Libraries.” ICCS, 2010.
Extended to systematic derivation of numerical stabilityanalysis.Paolo Bientinesi and Robert A. van de Geijn. ”A Goal-Oriented and Modular
Approach to Stability Analysis.” SIMAX. Conditionally accepted.
http://www.cs.utexas.edu/users/flame/ 42
![Page 75: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/75.jpg)
Deriving Algorithms to be Correct
Is the methodology just a theoretical curiosity?
Broad applicability to all operations supported by LAPACK
Not yet: eigensolvers, SVD.
The methodology is sufficiently systematic that it has beenautomated (with Mathematica).Paolo Bientinesi. ”Mechanical Derivation and Systematic Analysis of Correct
Linear Algebra Algorithms.” Dissertation, UT-Austin, 2006.
Recently generalized to the derivation of Krylov subspacemethods.Victor Eijkhout, Paolo Bientinesi, and Robert van de Geijn. ”Toward
Mechanical Derivation of Krylov Solver Libraries.” ICCS, 2010.
Extended to systematic derivation of numerical stabilityanalysis.Paolo Bientinesi and Robert A. van de Geijn. ”A Goal-Oriented and Modular
Approach to Stability Analysis.” SIMAX. Conditionally accepted.
http://www.cs.utexas.edu/users/flame/ 42
![Page 76: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/76.jpg)
Deriving Algorithms to be Correct
Is the methodology just a theoretical curiosity?
Broad applicability to all operations supported by LAPACK
Not yet: eigensolvers, SVD.
The methodology is sufficiently systematic that it has beenautomated (with Mathematica).Paolo Bientinesi. ”Mechanical Derivation and Systematic Analysis of Correct
Linear Algebra Algorithms.” Dissertation, UT-Austin, 2006.
Recently generalized to the derivation of Krylov subspacemethods.Victor Eijkhout, Paolo Bientinesi, and Robert van de Geijn. ”Toward
Mechanical Derivation of Krylov Solver Libraries.” ICCS, 2010.
Extended to systematic derivation of numerical stabilityanalysis.Paolo Bientinesi and Robert A. van de Geijn. ”A Goal-Oriented and Modular
Approach to Stability Analysis.” SIMAX. Conditionally accepted.
http://www.cs.utexas.edu/users/flame/ 42
![Page 77: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/77.jpg)
Deriving Algorithms to be Correct
Is the methodology just a theoretical curiosity?
Broad applicability to all operations supported by LAPACK
Not yet: eigensolvers, SVD.
The methodology is sufficiently systematic that it has beenautomated (with Mathematica).Paolo Bientinesi. ”Mechanical Derivation and Systematic Analysis of Correct
Linear Algebra Algorithms.” Dissertation, UT-Austin, 2006.
Recently generalized to the derivation of Krylov subspacemethods.Victor Eijkhout, Paolo Bientinesi, and Robert van de Geijn. ”Toward
Mechanical Derivation of Krylov Solver Libraries.” ICCS, 2010.
Extended to systematic derivation of numerical stabilityanalysis.Paolo Bientinesi and Robert A. van de Geijn. ”A Goal-Oriented and Modular
Approach to Stability Analysis.” SIMAX. Conditionally accepted.
http://www.cs.utexas.edu/users/flame/ 42
![Page 78: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/78.jpg)
Deriving Algorithms to be Correct
Is the methodology just a theoretical curiosity?
Broad applicability to all operations supported by LAPACK
Not yet: eigensolvers, SVD.
The methodology is sufficiently systematic that it has beenautomated (with Mathematica).Paolo Bientinesi. ”Mechanical Derivation and Systematic Analysis of Correct
Linear Algebra Algorithms.” Dissertation, UT-Austin, 2006.
Recently generalized to the derivation of Krylov subspacemethods.Victor Eijkhout, Paolo Bientinesi, and Robert van de Geijn. ”Toward
Mechanical Derivation of Krylov Solver Libraries.” ICCS, 2010.
Extended to systematic derivation of numerical stabilityanalysis.Paolo Bientinesi and Robert A. van de Geijn. ”A Goal-Oriented and Modular
Approach to Stability Analysis.” SIMAX. Conditionally accepted.
http://www.cs.utexas.edu/users/flame/ 42
![Page 79: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/79.jpg)
Deriving Algorithms to be Correct
How does this apply to parallel programming?
Coding correct parallel code is difficult.
We derive our (for now sequential) algorithms to be correct.
It is important to choose from a family of algorithms.
Choose an algorithm that parallelizes well.
http://www.cs.utexas.edu/users/flame/ 43
![Page 80: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/80.jpg)
Deriving Algorithms to be Correct
How does this apply to parallel programming?
Coding correct parallel code is difficult.
We derive our (for now sequential) algorithms to be correct.
It is important to choose from a family of algorithms.
Choose an algorithm that parallelizes well.
http://www.cs.utexas.edu/users/flame/ 43
![Page 81: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/81.jpg)
Deriving Algorithms to be Correct
How does this apply to parallel programming?
Coding correct parallel code is difficult.
We derive our (for now sequential) algorithms to be correct.
It is important to choose from a family of algorithms.
Choose an algorithm that parallelizes well.
http://www.cs.utexas.edu/users/flame/ 43
![Page 82: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/82.jpg)
From Correct Algorithm to Correct Code
1 Introduction
2 Notation
3 Deriving Algorithms to be Correct
4 From Correct Algorithm to Correct Code
5 Achieving High Performance
6 Fighting the War on Parallel Programming ErrorMultithreaded ArchitecturesDistributed Memory Parallel
7 Other Things I Could Talk About
8 How Do I Get to Use All This?
9 Conclusion
http://www.cs.utexas.edu/users/flame/ 44
![Page 83: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/83.jpg)
From Correct Algorithm to Correct Code
FLAME/C Code
Repartition„ATL ATR
ABL ABR
«→
0@ A00 a01 A02
aT10 α11 aT
12
A20 a21 A22
1Awhere α11 is a scalar
FLA_Repart_2x2_to_3x3(
ATL, /**/ ATR, &A00, /**/ &a01, &A02,
/* ************** */ /* *************************** */
&a10t, /**/ &alpha11, &a12t,
ABL, /**/ ABR, &A20, /**/ &a21, &A22,
1, 1, FLA_BR );
http://www.cs.utexas.edu/users/flame/ 45
![Page 84: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/84.jpg)
From Correct Algorithm to Correct Code
FLAME/C Code
Repartition„ATL ATR
ABL ABR
«→
0@ A00 a01 A02
aT10 α11 aT
12
A20 a21 A22
1Awhere α11 is a scalar
FLA_Repart_2x2_to_3x3(
ATL, /**/ ATR, &A00, /**/ &a01, &A02,
/* ************** */ /* *************************** */
&a10t, /**/ &alpha11, &a12t,
ABL, /**/ ABR, &A20, /**/ &a21, &A22,
1, 1, FLA_BR );
http://www.cs.utexas.edu/users/flame/ 45
![Page 85: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/85.jpg)
From Correct Algorithm to Correct Code
(Unblocked) FLAME/C Code
int FLA_Cholesky_unb( FLA_Obj A )
{
/* ... FLA_Part_2x2( ); ... */
while ( FLA_Obj_width( ATL ) < FLA_Obj_width( A ) ){
FLA_Repart_2x2_to_3x3(
ATL, /**/ ATR, &A00, /**/ &a01, &A02,
/* ************* */ /* ************************** */
&a10t, /**/ &alpha11, &a12t,
ABL, /**/ ABR, &A20, /**/ &a21, &A22,
1, 1, FLA_BR );
/*------------------------------------------------------------*/
FLA_Sqrt ( alpha11 ); /* a11 := sqrt( alpha11 ) */
FLA_Inv_Scal( alpha11, a21 ); /* a21 := a21 / alpha11 */
FLA_Syr ( FLA_LOWER_TRIANGULAR,
FLA_MINUS_ONE,
a21, A22 ); /* A22 := A22 - a21 * a21t */
/*------------------------------------------------------------*/
/* FLA_Cont_with_3x3_to_2x2( ); ... */
}
}
http://www.cs.utexas.edu/users/flame/ 46
![Page 86: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/86.jpg)
Achieving High Performance
1 Introduction
2 Notation
3 Deriving Algorithms to be Correct
4 From Correct Algorithm to Correct Code
5 Achieving High Performance
6 Fighting the War on Parallel Programming ErrorMultithreaded ArchitecturesDistributed Memory Parallel
7 Other Things I Could Talk About
8 How Do I Get to Use All This?
9 Conclusion
http://www.cs.utexas.edu/users/flame/ 47
![Page 87: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/87.jpg)
Achieving High Performance
Who is this famous (former) Texan?
http://www.cs.utexas.edu/users/flame/ 48
![Page 88: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/88.jpg)
Achieving High Performance
Who is this famous Texan? Kazushige Goto (TACC)
http://www.cs.utexas.edu/users/flame/ 49
![Page 89: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/89.jpg)
Achieving High Performance
High-Performance Matrix-Matrix Multiplication
Why is matrix-matrix multiplication (gemm) so important?
O(n3) computation on O(n2) computation.Allows data movement between RAM and cache to be hidden.Can achieve extremely high performance (up to 99% of peakon some architectures).
Required reading (shameless self-promotion):Kazushige Goto and Robert A. van de Geijn. “Anatomy of High-Performance
Matrix Multiplication,” ACM Transactions on Mathematical Software, 34(3):
Article 12, 25 pages, May 2008.
Use method to derive blocked algorithms that cast morecomputation in terms of gemm.
http://www.cs.utexas.edu/users/flame/ 50
![Page 90: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/90.jpg)
Achieving High Performance
High-Performance Matrix-Matrix Multiplication
Why is matrix-matrix multiplication (gemm) so important?
O(n3) computation on O(n2) computation.Allows data movement between RAM and cache to be hidden.Can achieve extremely high performance (up to 99% of peakon some architectures).
Required reading (shameless self-promotion):Kazushige Goto and Robert A. van de Geijn. “Anatomy of High-Performance
Matrix Multiplication,” ACM Transactions on Mathematical Software, 34(3):
Article 12, 25 pages, May 2008.
Use method to derive blocked algorithms that cast morecomputation in terms of gemm.
http://www.cs.utexas.edu/users/flame/ 50
![Page 91: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/91.jpg)
Achieving High Performance
High-Performance Matrix-Matrix Multiplication
Why is matrix-matrix multiplication (gemm) so important?
O(n3) computation on O(n2) computation.Allows data movement between RAM and cache to be hidden.Can achieve extremely high performance (up to 99% of peakon some architectures).
Required reading (shameless self-promotion):Kazushige Goto and Robert A. van de Geijn. “Anatomy of High-Performance
Matrix Multiplication,” ACM Transactions on Mathematical Software, 34(3):
Article 12, 25 pages, May 2008.
Use method to derive blocked algorithms that cast morecomputation in terms of gemm.
http://www.cs.utexas.edu/users/flame/ 50
![Page 92: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/92.jpg)
Achieving High Performance
High-Performance Matrix-Matrix Multiplication
Why is matrix-matrix multiplication (gemm) so important?
O(n3) computation on O(n2) computation.Allows data movement between RAM and cache to be hidden.Can achieve extremely high performance (up to 99% of peakon some architectures).
Required reading (shameless self-promotion):Kazushige Goto and Robert A. van de Geijn. “Anatomy of High-Performance
Matrix Multiplication,” ACM Transactions on Mathematical Software, 34(3):
Article 12, 25 pages, May 2008.
Use method to derive blocked algorithms that cast morecomputation in terms of gemm.
http://www.cs.utexas.edu/users/flame/ 50
![Page 93: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/93.jpg)
Achieving High Performance
(Unblocked) FLAME/C Code (Again)
int FLA_Cholesky_unb( FLA_Obj A )
{
/* ... FLA_Part_2x2( ); ... */
while ( FLA_Obj_width( ATL ) < FLA_Obj_width( A ) ){
FLA_Repart_2x2_to_3x3(
ATL, /**/ ATR, &A00, /**/ &a01, &A02,
/* ************* */ /* ************************** */
&a10t, /**/ &alpha11, &a12t,
ABL, /**/ ABR, &A20, /**/ &a21, &A22,
1, 1, FLA_BR );
/*------------------------------------------------------------*/
FLA_Sqrt ( alpha11 ); /* a11 := sqrt( alpha11 ) */
FLA_Inv_Scal( alpha11, a21 ); /* a21 := a21 / alpha11 */
FLA_Syr ( FLA_LOWER_TRIANGULAR,
FLA_MINUS_ONE,
a21, A22 ); /* A22 := A22 - a21 * a21t */
/*------------------------------------------------------------*/
/* FLA_Cont_with_3x3_to_2x2( ); ... */
}
}
http://www.cs.utexas.edu/users/flame/ 51
![Page 94: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/94.jpg)
Achieving High Performance
Blocked FLAME/C Code
int FLA_Cholesky_blk( FLA_Obj A, int nb_alg )
{
/* ... FLA_Part_2x2( ); ... */
while ( FLA_Obj_width( ATL ) < FLA_Obj_width( A ) ){
b = min( FLA_Obj_length( ABR ), nb_alg );
FLA_Repart_2x2_to_3x3(
ATL, /**/ ATR, &A00, /**/ &A01, &A02,
/* ************* */ /* ******************** */
&A10, /**/ &A11, &A12,
ABL, /**/ ABR, &A20, /**/ &A21, &A22,
b, b, FLA_BR );
/*------------------------------------------------------------*/
FLA_Cholesky_unb( A11 ); /* A11 := Cholesky( A11 ) */
FLA_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR,
FLA_TRANSPOSE, FLA_NONUNIT_DIAG,
FLA_ONE, A11,
A21 ); /* A21 := A21 * inv( A11 )’*/
FLA_Syrk( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE,
FLA_MINUS_ONE, A21, A22 ); /* A22 := A22 - A21 * A21’ */
/*------------------------------------------------------------*/
/* FLA_Cont_with_3x3_to_2x2( ); ... */
}
}
http://www.cs.utexas.edu/users/flame/ 52
![Page 95: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/95.jpg)
Fighting the War on Parallel Programming Error
1 Introduction
2 Notation
3 Deriving Algorithms to be Correct
4 From Correct Algorithm to Correct Code
5 Achieving High Performance
6 Fighting the War on Parallel Programming ErrorMultithreaded ArchitecturesDistributed Memory Parallel
7 Other Things I Could Talk About
8 How Do I Get to Use All This?
9 Conclusion
http://www.cs.utexas.edu/users/flame/ 53
![Page 96: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/96.jpg)
Fighting the War on Parallel Programming Error
“When we had no computers, we had no programming problemeither. When we had a few computers, we had a mildprogramming problem. Confronted with machines a million timesas powerful, we are faced with a gigantic programming problem.”
– Dijkstra
http://www.cs.utexas.edu/users/flame/ 54
![Page 97: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/97.jpg)
Fighting the War on Parallel Programming Error Multithreaded Architectures
1 Introduction
2 Notation
3 Deriving Algorithms to be Correct
4 From Correct Algorithm to Correct Code
5 Achieving High Performance
6 Fighting the War on Parallel Programming ErrorMultithreaded ArchitecturesDistributed Memory Parallel
7 Other Things I Could Talk About
8 How Do I Get to Use All This?
9 Conclusion
http://www.cs.utexas.edu/users/flame/ 55
![Page 98: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/98.jpg)
Fighting the War on Parallel Programming Error Multithreaded Architectures
LAPACK parallelization: multithreaded BLAS
A11 ?
A21 A22
, A11 is b× b
Pro?:
Evolve legacy code
Con:
Continue to code in the LINPACK style (1970s)Each call to BLAS (compute kernels) is a synchronizationpoint for threadsAs the number of threads increases, serial operations with costO(nb2) or O(b3) are no longer negligible compared withO(n2b)
http://www.cs.utexas.edu/users/flame/ 56
![Page 99: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/99.jpg)
Fighting the War on Parallel Programming Error Multithreaded Architectures
LAPACK parallelization: multithreaded BLAS
A11 ?
A21 A22
, A11 is b× b
Pro?:
Evolve legacy code
Con:
Continue to code in the LINPACK style (1970s)Each call to BLAS (compute kernels) is a synchronizationpoint for threadsAs the number of threads increases, serial operations with costO(nb2) or O(b3) are no longer negligible compared withO(n2b)
http://www.cs.utexas.edu/users/flame/ 56
![Page 100: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/100.jpg)
Fighting the War on Parallel Programming Error Multithreaded Architectures
LAPACK parallelization: multithreaded BLAS
A11 ?
A21 A22
, A11 is b× b
Pro?:
Evolve legacy code
Con:
Continue to code in the LINPACK style (1970s)Each call to BLAS (compute kernels) is a synchronizationpoint for threadsAs the number of threads increases, serial operations with costO(nb2) or O(b3) are no longer negligible compared withO(n2b)
http://www.cs.utexas.edu/users/flame/ 56
![Page 101: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/101.jpg)
Fighting the War on Parallel Programming Error Multithreaded Architectures
Of algorithms-by-blocks and runtime systems
Improve parallelism and data locality: algorithms-by-blocks
Matrix of matrix blocksMatrix blocks as unit of dataComputation with matrix blocks as unit of computation
Execute sequential code to generate DAG of tasks
SuperMatrix
Runtime system for scheduling tasks to threads
Sequential kernels to be executed by the threads
Always be sure to make the machine-specific part someoneelse’s problem
SuperMatrix is part of Ernie Chan’s dissertation work.
http://www.cs.utexas.edu/users/flame/ 57
![Page 102: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/102.jpg)
Fighting the War on Parallel Programming Error Multithreaded Architectures
Of algorithms-by-blocks and runtime systems
Improve parallelism and data locality: algorithms-by-blocks
Matrix of matrix blocksMatrix blocks as unit of dataComputation with matrix blocks as unit of computation
Execute sequential code to generate DAG of tasks
SuperMatrix
Runtime system for scheduling tasks to threads
Sequential kernels to be executed by the threads
Always be sure to make the machine-specific part someoneelse’s problem
SuperMatrix is part of Ernie Chan’s dissertation work.
http://www.cs.utexas.edu/users/flame/ 57
![Page 103: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/103.jpg)
Fighting the War on Parallel Programming Error Multithreaded Architectures
Of algorithms-by-blocks and runtime systems
Improve parallelism and data locality: algorithms-by-blocks
Matrix of matrix blocksMatrix blocks as unit of dataComputation with matrix blocks as unit of computation
Execute sequential code to generate DAG of tasks
SuperMatrix
Runtime system for scheduling tasks to threads
Sequential kernels to be executed by the threads
Always be sure to make the machine-specific part someoneelse’s problem
SuperMatrix is part of Ernie Chan’s dissertation work.
http://www.cs.utexas.edu/users/flame/ 57
![Page 104: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/104.jpg)
Fighting the War on Parallel Programming Error Multithreaded Architectures
Of algorithms-by-blocks and runtime systems
Improve parallelism and data locality: algorithms-by-blocks
Matrix of matrix blocksMatrix blocks as unit of dataComputation with matrix blocks as unit of computation
Execute sequential code to generate DAG of tasks
SuperMatrix
Runtime system for scheduling tasks to threads
Sequential kernels to be executed by the threads
Always be sure to make the machine-specific part someoneelse’s problem
SuperMatrix is part of Ernie Chan’s dissertation work.
http://www.cs.utexas.edu/users/flame/ 57
![Page 105: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/105.jpg)
Fighting the War on Parallel Programming Error Multithreaded Architectures
Of algorithms-by-blocks and runtime systems
Improve parallelism and data locality: algorithms-by-blocks
Matrix of matrix blocksMatrix blocks as unit of dataComputation with matrix blocks as unit of computation
Execute sequential code to generate DAG of tasks
SuperMatrix
Runtime system for scheduling tasks to threads
Sequential kernels to be executed by the threads
Always be sure to make the machine-specific part someoneelse’s problem
SuperMatrix is part of Ernie Chan’s dissertation work.
http://www.cs.utexas.edu/users/flame/ 57
![Page 106: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/106.jpg)
Fighting the War on Parallel Programming Error Multithreaded Architectures
Of algorithms-by-blocks and runtime systems
Improve parallelism and data locality: algorithms-by-blocks
Matrix of matrix blocksMatrix blocks as unit of dataComputation with matrix blocks as unit of computation
Execute sequential code to generate DAG of tasks
SuperMatrix
Runtime system for scheduling tasks to threads
Sequential kernels to be executed by the threads
Always be sure to make the machine-specific part someoneelse’s problem
SuperMatrix is part of Ernie Chan’s dissertation work.
http://www.cs.utexas.edu/users/flame/ 57
![Page 107: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/107.jpg)
Fighting the War on Parallel Programming Error Multithreaded Architectures
Of algorithms-by-blocks and runtime systems
Improve parallelism and data locality: algorithms-by-blocks
Matrix of matrix blocksMatrix blocks as unit of dataComputation with matrix blocks as unit of computation
Execute sequential code to generate DAG of tasks
SuperMatrix
Runtime system for scheduling tasks to threads
Sequential kernels to be executed by the threads
Always be sure to make the machine-specific part someoneelse’s problem
SuperMatrix is part of Ernie Chan’s dissertation work.
http://www.cs.utexas.edu/users/flame/ 57
![Page 108: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/108.jpg)
Fighting the War on Parallel Programming Error Multithreaded Architectures
Of algorithms-by-blocks and runtime systems
Improve parallelism and data locality: algorithms-by-blocks
Matrix of matrix blocksMatrix blocks as unit of dataComputation with matrix blocks as unit of computation
Execute sequential code to generate DAG of tasks
SuperMatrix
Runtime system for scheduling tasks to threads
Sequential kernels to be executed by the threads
Always be sure to make the machine-specific part someoneelse’s problem
SuperMatrix is part of Ernie Chan’s dissertation work.
http://www.cs.utexas.edu/users/flame/ 57
![Page 109: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/109.jpg)
Fighting the War on Parallel Programming Error Multithreaded Architectures
Of algorithms-by-blocks and runtime systems
Improve parallelism and data locality: algorithms-by-blocks
Matrix of matrix blocksMatrix blocks as unit of dataComputation with matrix blocks as unit of computation
Execute sequential code to generate DAG of tasks
SuperMatrix
Runtime system for scheduling tasks to threads
Sequential kernels to be executed by the threads
Always be sure to make the machine-specific part someoneelse’s problem
SuperMatrix is part of Ernie Chan’s dissertation work.
http://www.cs.utexas.edu/users/flame/ 57
![Page 110: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/110.jpg)
Fighting the War on Parallel Programming Error Multithreaded Architectures
A =
A(0,0) ? ? · · · ?
A(1,0) A(1,1) ? · · · ?
A(2,0) A(2,1) A(2,2) · · · ?...
......
. . ....
A(M−1,0) A(M−1,1) A(M−1,2) · · · A(M−1,N−1)
http://www.cs.utexas.edu/users/flame/ 58
![Page 111: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/111.jpg)
Fighting the War on Parallel Programming Error Multithreaded Architectures
Algorithm-by-blocks implementation: (almost) no change
int FLA_Cholesky_blk( FLA_Obj A, int nb_alg )
{
/* ... FLA_Part_2x2( ); ... */
while ( FLA_Obj_width( ATL ) < FLA_Obj_width( A ) ){
b = min( FLA_Obj_length( ABR ), nb_alg );
FLA_Repart_2x2_to_3x3(
ATL, /**/ ATR, &A00, /**/ &A01, &A02,
/* ************* */ /* ******************** */
&A10, /**/ &A11, &A12,
ABL, /**/ ABR, &A20, /**/ &A21, &A22,
1, 1, FLA_BR );
/*------------------------------------------------------------*/
FLA_Chol( FLA_LOWER_TRIANGULAR,
*FLASH_OBJ_PTR_AT( A11 ) );
FLASH_Trsm( FLA_RIGHT, FLA_LOWER_TRIANGULAR,
FLA_TRANSPOSE, FLA_NONUNIT_DIAG,
FLA_ONE, A11, A21 );
FLASH_Syrk( FLA_LOWER_TRIANGULAR, FLA_NO_TRANSPOSE,
FLA_MINUS_ONE, A21, FLA_ONE, A22 );
/*------------------------------------------------------------*/
/* FLA_Cont_with_3x3_to_2x2( ); ... */
}
}
http://www.cs.utexas.edu/users/flame/ 59
![Page 112: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/112.jpg)
Fighting the War on Parallel Programming Error Multithreaded Architectures
The FLAME runtime system “pre-executes” the code.
Whenever a routine is encountered, a pending task isannotated in a global task queue
http://www.cs.utexas.edu/users/flame/ 60
![Page 113: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/113.jpg)
Fighting the War on Parallel Programming Error Multithreaded Architectures
The FLAME runtime system “pre-executes” the code.
Whenever a routine is encountered, a pending task isannotated in a global task queue
http://www.cs.utexas.edu/users/flame/ 60
![Page 114: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/114.jpg)
Fighting the War on Parallel Programming Error Multithreaded Architectures
A(0,0) ? ? · · · ?
A(1,0) A(1,1) ? · · · ?
A(2,0) A(2,1) A(2,2) · · · ?...
......
. . ....
A(M−1,0) A(M−1,1) A(M−1,2) · · · A(M−1,N−1)
A(0,0) ? ? · · · ?
A(1,0) A(1,1) ? · · · ?
A(2,0) A(2,1) A(2,2) · · · ?...
......
. . ....
A(M−1,0) A(M−1,1) A(M−1,2) · · · A(M−1,N−1)
http://www.cs.utexas.edu/users/flame/ 61
![Page 115: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/115.jpg)
Fighting the War on Parallel Programming Error Multithreaded Architectures
FLAME Parallelization: SuperMatrix
A(0,0) ? ? · · ·A(1,0) A(1,1) ? · · ·A(2,0) A(2,1) A(2,2) · · ·
......
.... . .
→at
runtimebuildDAG
FLA Cholesky unb(A(0,0))
A(1,0) := A(1,0) tril“A(0,0)−T
”A(2,0) := A(2,0) tril
“A(0,0)−T
”...
A(1,1) := A(1,1) −A(1,0)A(1,0) T
...
SuperMatrix
Once all tasks are entered on DAG, the real execution begins!
Tasks with all input operands available are ready, other tasksmust wait in the global queue
Upon termination of a task, the corresponding thread updatesthe list of pending tasks
http://www.cs.utexas.edu/users/flame/ 62
![Page 116: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/116.jpg)
Fighting the War on Parallel Programming Error Multithreaded Architectures
FLAME Parallelization: SuperMatrix
A(0,0) ? ? · · ·A(1,0) A(1,1) ? · · ·A(2,0) A(2,1) A(2,2) · · ·
......
.... . .
→at
runtimebuildDAG
FLA Cholesky unb(A(0,0))
A(1,0) := A(1,0) tril“A(0,0)−T
”A(2,0) := A(2,0) tril
“A(0,0)−T
”...
A(1,1) := A(1,1) −A(1,0)A(1,0) T
...
SuperMatrix
Once all tasks are entered on DAG, the real execution begins!
Tasks with all input operands available are ready, other tasksmust wait in the global queue
Upon termination of a task, the corresponding thread updatesthe list of pending tasks
http://www.cs.utexas.edu/users/flame/ 62
![Page 117: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/117.jpg)
Fighting the War on Parallel Programming Error Multithreaded Architectures
FLAME Parallelization: SuperMatrix
A(0,0) ? ? · · ·A(1,0) A(1,1) ? · · ·A(2,0) A(2,1) A(2,2) · · ·
......
.... . .
→at
runtimebuildDAG
FLA Cholesky unb(A(0,0))
A(1,0) := A(1,0) tril“A(0,0)−T
”A(2,0) := A(2,0) tril
“A(0,0)−T
”...
A(1,1) := A(1,1) −A(1,0)A(1,0) T
...
SuperMatrix
Once all tasks are entered on DAG, the real execution begins!
Tasks with all input operands available are ready, other tasksmust wait in the global queue
Upon termination of a task, the corresponding thread updatesthe list of pending tasks
http://www.cs.utexas.edu/users/flame/ 62
![Page 118: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/118.jpg)
Fighting the War on Parallel Programming Error Multithreaded Architectures
FLAME Parallelization: SuperMatrix
A(0,0) ? ? · · ·A(1,0) A(1,1) ? · · ·A(2,0) A(2,1) A(2,2) · · ·
......
.... . .
→at
runtimebuildDAG
FLA Cholesky unb(A(0,0))
A(1,0) := A(1,0) tril“A(0,0)−T
”A(2,0) := A(2,0) tril
“A(0,0)−T
”...
A(1,1) := A(1,1) −A(1,0)A(1,0) T
...
SuperMatrix
Once all tasks are entered on DAG, the real execution begins!
Tasks with all input operands available are ready, other tasksmust wait in the global queue
Upon termination of a task, the corresponding thread updatesthe list of pending tasks
http://www.cs.utexas.edu/users/flame/ 62
![Page 119: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/119.jpg)
Fighting the War on Parallel Programming Error Multithreaded Architectures
FLAME Parallelization: SuperMatrix
A(0,0) ? ? · · ·A(1,0) A(1,1) ? · · ·A(2,0) A(2,1) A(2,2) · · ·
......
.... . .
→at
runtimebuildDAG
FLA Cholesky unb(A(0,0))
A(1,0) := A(1,0) tril“A(0,0)−T
”A(2,0) := A(2,0) tril
“A(0,0)−T
”...
A(1,1) := A(1,1) −A(1,0)A(1,0) T
...
SuperMatrix
Once all tasks are entered on DAG, the real execution begins!
Tasks with all input operands available are ready, other tasksmust wait in the global queue
Upon termination of a task, the corresponding thread updatesthe list of pending tasks
http://www.cs.utexas.edu/users/flame/ 62
![Page 120: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/120.jpg)
Fighting the War on Parallel Programming Error Multithreaded Architectures
FLAME Parallelization: SuperMatrix
A(0,0) ? ? · · ·A(1,0) A(1,1) ? · · ·A(2,0) A(2,1) A(2,2) · · ·
......
.... . .
→at
runtimebuildDAG
FLA Cholesky unb(A(0,0))
A(1,0) := A(1,0) tril“A(0,0)−T
”A(2,0) := A(2,0) tril
“A(0,0)−T
”...
A(1,1) := A(1,1) −A(1,0)A(1,0) T
...
SuperMatrix
Once all tasks are entered on DAG, the real execution begins!
Tasks with all input operands available are ready, other tasksmust wait in the global queue
Upon termination of a task, the corresponding thread updatesthe list of pending tasks
http://www.cs.utexas.edu/users/flame/ 62
![Page 121: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/121.jpg)
Fighting the War on Parallel Programming Error Multithreaded Architectures
Separation of concerns simplifies programming
Library code that can target many architectures.
Run-time system that can implement different schedulers fordifferent situations.
http://www.cs.utexas.edu/users/flame/ 63
![Page 122: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/122.jpg)
Fighting the War on Parallel Programming Error Multithreaded Architectures
Separation of concerns simplifies programming
Library code that can target many architectures.
Run-time system that can implement different schedulers fordifferent situations.
http://www.cs.utexas.edu/users/flame/ 63
![Page 123: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/123.jpg)
Fighting the War on Parallel Programming Error Multithreaded Architectures
Separation of concerns simplifies programming
Library code that can target many architectures.
Run-time system that can implement different schedulers fordifferent situations.
http://www.cs.utexas.edu/users/flame/ 63
![Page 124: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/124.jpg)
Fighting the War on Parallel Programming Error Multithreaded Architectures
Who is this famous Texan?
UT-Texas must be the better, faster, more successful!
http://www.cs.utexas.edu/users/flame/ 64
![Page 125: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/125.jpg)
Fighting the War on Parallel Programming Error Multithreaded Architectures
Who is this famous Texan?
UT-Texas must be the better, faster, more successful!
http://www.cs.utexas.edu/users/flame/ 64
![Page 126: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/126.jpg)
Fighting the War on Parallel Programming Error Multithreaded Architectures
Target Architecture 1
4 socket 2.66 GHz Intel Dunnington - 24 cores
16MB shared L3 cache per socket
OpenMP Intel compiler 11.1
Intel MKL 11.1 (Windows), 10.2 (Linux)
http://www.cs.utexas.edu/users/flame/ 65
![Page 127: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/127.jpg)
Fighting the War on Parallel Programming Error Multithreaded Architectures
Cholesky factorization (Linux)
http://www.cs.utexas.edu/users/flame/ 66
![Page 128: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/128.jpg)
Fighting the War on Parallel Programming Error Multithreaded Architectures
Cholesky factorization (Windows)
http://www.cs.utexas.edu/users/flame/ 67
![Page 129: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/129.jpg)
Fighting the War on Parallel Programming Error Multithreaded Architectures
LU factorization (Linux)
http://www.cs.utexas.edu/users/flame/ 68
![Page 130: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/130.jpg)
Fighting the War on Parallel Programming Error Multithreaded Architectures
QR factorization (Linux)
http://www.cs.utexas.edu/users/flame/ 69
![Page 131: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/131.jpg)
Fighting the War on Parallel Programming Error Multithreaded Architectures
Target Architecture 2
4 socket 2.3 GHz AMD Opteron Quad-Core
2MB shared L3 cache per socket
OpenMP Intel compiler 10.1
GotoBLAS2 1.00
http://www.cs.utexas.edu/users/flame/ 70
![Page 132: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/132.jpg)
Fighting the War on Parallel Programming Error Multithreaded Architectures
LU factorization with pivoting
http://www.cs.utexas.edu/users/flame/ 71
![Page 133: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/133.jpg)
Fighting the War on Parallel Programming Error Multithreaded Architectures
Related Approaches
Cilk (MIT), TBB (Intel) and SMPSs (Barcelona SuperComputingCenter)
General-purpose parallel programmingCilk, TBB → irregular/recursive problemsSMPSs → more general, also manages dependencies
High-level language based on OpenMP-like pragmas +compiler + runtime system
Modest results for dense linear algebra
PLASMA Project
Next step in the LAPACK evolutionary path
Traditional style of implementing algorithms
Does not solve the programmability problem
Hierarchically Tiled Arrays
Abstraction for computing with matrices stored by blocks.http://www.cs.utexas.edu/users/flame/ 72
![Page 134: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/134.jpg)
Fighting the War on Parallel Programming Error Multithreaded Architectures
Related Approaches
Cilk (MIT), TBB (Intel) and SMPSs (Barcelona SuperComputingCenter)
General-purpose parallel programmingCilk, TBB → irregular/recursive problemsSMPSs → more general, also manages dependencies
High-level language based on OpenMP-like pragmas +compiler + runtime system
Modest results for dense linear algebra
PLASMA Project
Next step in the LAPACK evolutionary path
Traditional style of implementing algorithms
Does not solve the programmability problem
Hierarchically Tiled Arrays
Abstraction for computing with matrices stored by blocks.http://www.cs.utexas.edu/users/flame/ 72
![Page 135: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/135.jpg)
Fighting the War on Parallel Programming Error Multithreaded Architectures
Related Approaches
Cilk (MIT), TBB (Intel) and SMPSs (Barcelona SuperComputingCenter)
General-purpose parallel programmingCilk, TBB → irregular/recursive problemsSMPSs → more general, also manages dependencies
High-level language based on OpenMP-like pragmas +compiler + runtime system
Modest results for dense linear algebra
PLASMA Project
Next step in the LAPACK evolutionary path
Traditional style of implementing algorithms
Does not solve the programmability problem
Hierarchically Tiled Arrays
Abstraction for computing with matrices stored by blocks.http://www.cs.utexas.edu/users/flame/ 72
![Page 136: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/136.jpg)
Fighting the War on Parallel Programming Error Multithreaded Architectures
Related Approaches
Cilk (MIT), TBB (Intel) and SMPSs (Barcelona SuperComputingCenter)
General-purpose parallel programmingCilk, TBB → irregular/recursive problemsSMPSs → more general, also manages dependencies
High-level language based on OpenMP-like pragmas +compiler + runtime system
Modest results for dense linear algebra
PLASMA Project
Next step in the LAPACK evolutionary path
Traditional style of implementing algorithms
Does not solve the programmability problem
Hierarchically Tiled Arrays
Abstraction for computing with matrices stored by blocks.http://www.cs.utexas.edu/users/flame/ 72
![Page 137: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/137.jpg)
Fighting the War on Parallel Programming Error Distributed Memory Parallel
1 Introduction
2 Notation
3 Deriving Algorithms to be Correct
4 From Correct Algorithm to Correct Code
5 Achieving High Performance
6 Fighting the War on Parallel Programming ErrorMultithreaded ArchitecturesDistributed Memory Parallel
7 Other Things I Could Talk About
8 How Do I Get to Use All This?
9 Conclusion
http://www.cs.utexas.edu/users/flame/ 73
![Page 138: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/138.jpg)
Fighting the War on Parallel Programming Error Distributed Memory Parallel
Didn’t we solve the problem in the 1990s?
ScaLAPACK (UTK/Berkeley)
Previous step in the LAPACK evolution.Rooted in LAPACK which itself rooted in LINPACK (1970s)
PLAPACK (UT-Austin)
Object-based libraryInspired the FLAME approach
For very large problems on distributed memory clusters, theseshould suffice.
http://www.cs.utexas.edu/users/flame/ 74
![Page 139: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/139.jpg)
Fighting the War on Parallel Programming Error Distributed Memory Parallel
Didn’t we solve the problem in the 1990s?
ScaLAPACK (UTK/Berkeley)
Previous step in the LAPACK evolution.Rooted in LAPACK which itself rooted in LINPACK (1970s)
PLAPACK (UT-Austin)
Object-based libraryInspired the FLAME approach
For very large problems on distributed memory clusters, theseshould suffice.
http://www.cs.utexas.edu/users/flame/ 74
![Page 140: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/140.jpg)
Fighting the War on Parallel Programming Error Distributed Memory Parallel
Didn’t we solve the problem in the 1990s?
ScaLAPACK (UTK/Berkeley)
Previous step in the LAPACK evolution.Rooted in LAPACK which itself rooted in LINPACK (1970s)
PLAPACK (UT-Austin)
Object-based libraryInspired the FLAME approach
For very large problems on distributed memory clusters, theseshould suffice.
http://www.cs.utexas.edu/users/flame/ 74
![Page 141: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/141.jpg)
Fighting the War on Parallel Programming Error Distributed Memory Parallel
Didn’t we solve the problem in the 1990s?
ScaLAPACK (UTK/Berkeley)
Previous step in the LAPACK evolution.Rooted in LAPACK which itself rooted in LINPACK (1970s)
PLAPACK (UT-Austin)
Object-based libraryInspired the FLAME approach
For very large problems on distributed memory clusters, theseshould suffice.
http://www.cs.utexas.edu/users/flame/ 74
![Page 142: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/142.jpg)
Fighting the War on Parallel Programming Error Distributed Memory Parallel
Renewed interest in distributed memory libraries
Intel’s SCC research processor
48 Pentium cores on one chip.
Connected via very fast on-chipcommunication buffers.
No cache-coherency protocol.
Purpose: to study theprogrammability problem formany-core architectures.
http://www.cs.utexas.edu/users/flame/ 75
![Page 143: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/143.jpg)
Fighting the War on Parallel Programming Error Distributed Memory Parallel
A New Framework for Distributed Memory Dense MatrixLibraries
Elemental (Jack Poulson + Bryan Marker)
C++ coded in the style of FLAME/C
2D elemental cyclic matrix distribution.
Does NOT tie algorithmic block size to distribution block size.
ScaLAPACK
Fortran77 coded in the style of LAPACK.
2D block cyclic matrix distribution.
Ties algorithmic block size to distribution block size.
http://www.cs.utexas.edu/users/flame/ 76
![Page 144: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/144.jpg)
Fighting the War on Parallel Programming Error Distributed Memory Parallel
A New Framework for Distributed Memory Dense MatrixLibraries
Elemental (Jack Poulson + Bryan Marker)
C++ coded in the style of FLAME/C
2D elemental cyclic matrix distribution.
Does NOT tie algorithmic block size to distribution block size.
ScaLAPACK
Fortran77 coded in the style of LAPACK.
2D block cyclic matrix distribution.
Ties algorithmic block size to distribution block size.
http://www.cs.utexas.edu/users/flame/ 76
![Page 145: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/145.jpg)
Fighting the War on Parallel Programming Error Distributed Memory Parallel
A New Framework for Distributed Memory Dense MatrixLibraries
Elemental (Jack Poulson + Bryan Marker)
C++ coded in the style of FLAME/C
2D elemental cyclic matrix distribution.
Does NOT tie algorithmic block size to distribution block size.
ScaLAPACK
Fortran77 coded in the style of LAPACK.
2D block cyclic matrix distribution.
Ties algorithmic block size to distribution block size.
http://www.cs.utexas.edu/users/flame/ 76
![Page 146: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/146.jpg)
Fighting the War on Parallel Programming Error Distributed Memory Parallel
A New Framework for Distributed Memory Dense MatrixLibraries
Elemental (Jack Poulson + Bryan Marker)
C++ coded in the style of FLAME/C
2D elemental cyclic matrix distribution.
Does NOT tie algorithmic block size to distribution block size.
ScaLAPACK
Fortran77 coded in the style of LAPACK.
2D block cyclic matrix distribution.
Ties algorithmic block size to distribution block size.
http://www.cs.utexas.edu/users/flame/ 76
![Page 147: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/147.jpg)
Fighting the War on Parallel Programming Error Distributed Memory Parallel
A New Framework for Distributed Memory Dense MatrixLibraries
Elemental (Jack Poulson + Bryan Marker)
C++ coded in the style of FLAME/C
2D elemental cyclic matrix distribution.
Does NOT tie algorithmic block size to distribution block size.
ScaLAPACK
Fortran77 coded in the style of LAPACK.
2D block cyclic matrix distribution.
Ties algorithmic block size to distribution block size.
http://www.cs.utexas.edu/users/flame/ 76
![Page 148: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/148.jpg)
Fighting the War on Parallel Programming Error Distributed Memory Parallel
Elemental: FLAME for distributed memory architectures
template<typename T>
void
Elemental::LAPACK::Internal::CholLVar3
( DistMatrix<T,MC,MR>& A )
{
const Grid& grid = A.GetGrid();
// Matrix views
DistMatrix<T,MC,MR>
ATL(grid), ATR(grid), A00(grid), A01(grid), A02(grid),
ABL(grid), ABR(grid), A10(grid), A11(grid), A12(grid),
A20(grid), A21(grid), A22(grid);
// Temporary matrix distributions
DistMatrix<T,Star,Star> A11_Star_Star(grid);
DistMatrix<T,VC, Star> A21_VC_Star(grid);
DistMatrix<T,MC, Star> A21_MC_Star(grid);
DistMatrix<T,MR, Star> A21_MR_Star(grid);
// Start the algorithm
http://www.cs.utexas.edu/users/flame/ 78
![Page 149: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/149.jpg)
Fighting the War on Parallel Programming Error Distributed Memory Parallel
PartitionDownDiagonal( A, ATL, ATR,
ABL, ABR );
while( ABR.Height() > 0 )
{
RepartitionDownDiagonal( ATL, /**/ ATR, A00, /**/ A01, A02,
/*************/ /******************/
/**/ A10, /**/ A11, A12,
ABL, /**/ ABR, A20, /**/ A21, A22 );
A21_MC_Star.AlignWith( A22 );
A21_MR_Star.AlignWith( A22 );
//--------------------------------------------------------------------//
A11_Star_Star = A11;
LAPACK::Chol( Lower, A11_Star_Star.LocalMatrix() );
A11 = A11_Star_Star;
A21_VC_Star = A21;
BLAS::Trsm( Right, Lower, ConjugateTranspose, NonUnit,
(T)1, A11_Star_Star.LockedLocalMatrix(),
A21_VC_Star.LocalMatrix() );
A21_MC_Star = A21_VC_Star;
A21_MR_Star = A21_VC_Star;
BLAS::Internal::HerkLNUpdate( (T)-1, A21_MC_Star, A21_MR_Star,(T)1, A22 );
A21 = A21_MC_Star;
//--------------------------------------------------------------------//
A21_MC_Star.FreeConstraints();
A21_MR_Star.FreeConstraints();
SlidePartitionDownDiagonal( ATL, /**/ ATR, A00, A01, /**/ A02,
/**/ A10, A11, /**/ A12,
/*************/ /******************/
ABL, /**/ ABR, A20, A21, /**/ A22 );
}
}
http://www.cs.utexas.edu/users/flame/ 80
![Page 150: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/150.jpg)
Fighting the War on Parallel Programming Error Distributed Memory Parallel
Target Architecture 3
Total of 15× 4× 4 = 240 cores:
15 nodes (out of 3936 nodes)
4 socket 2.3 GHz AMD Opteron Quad-Core
2MB shared L3 cache per socket
fill-CLAS InfiniBand 1Gb/sec
MVAPICH2 Release 1.2
GotoBLAS 1.30
http://www.cs.utexas.edu/users/flame/ 81
![Page 151: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/151.jpg)
Fighting the War on Parallel Programming Error Distributed Memory Parallel
Elemental GEMM, 240 cores
http://www.cs.utexas.edu/users/flame/ 82
![Page 152: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/152.jpg)
Fighting the War on Parallel Programming Error Distributed Memory Parallel
Elemental Cholesky, 240 cores
http://www.cs.utexas.edu/users/flame/ 83
![Page 153: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/153.jpg)
Fighting the War on Parallel Programming Error Distributed Memory Parallel
Elemental LU with partial pivoting, 240 cores
http://www.cs.utexas.edu/users/flame/ 84
![Page 154: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/154.jpg)
Fighting the War on Parallel Programming Error Distributed Memory Parallel
An Exercise in Portability: Elemental → SCC
Jan 11 - Tim Mattson:We will send the emulator
Jan 31 - Bryan Marker:I’m almost ready for you to test the stationary C Gemm (NN)
implementation in [Elemental] on an actual [SCC] board.
Feb 3 - Bryan Marker:Alright gentlemen, I have two test programs for Gemm C NN (one I createdand one by Jack).
Feb 4 - Bryan Marker:FYI, I have the Cholesky variant 2 ported [to the emulator]. Very easy fixto avoid SendRecv.
(Many weeks of no progress while everyone was busy with other things)
March 18 - Rob van der Wijngaart (Intel):Good news, [...] the app is running on SCC as we speak. Some of thetests inside the app are reporting failures, but these can now be debugged.
March 18 - Bryan Marker:Some tests are expected to fail because they require SendRecv, whichisn’t in the old code you have.
http://www.cs.utexas.edu/users/flame/ 85
![Page 155: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/155.jpg)
Fighting the War on Parallel Programming Error Distributed Memory Parallel
An Exercise in Portability: Elemental → SCC
Jan 11 - Tim Mattson:We will send the emulator
Jan 31 - Bryan Marker:I’m almost ready for you to test the stationary C Gemm (NN)
implementation in [Elemental] on an actual [SCC] board.
Feb 3 - Bryan Marker:Alright gentlemen, I have two test programs for Gemm C NN (one I createdand one by Jack).
Feb 4 - Bryan Marker:FYI, I have the Cholesky variant 2 ported [to the emulator]. Very easy fixto avoid SendRecv.
(Many weeks of no progress while everyone was busy with other things)
March 18 - Rob van der Wijngaart (Intel):Good news, [...] the app is running on SCC as we speak. Some of thetests inside the app are reporting failures, but these can now be debugged.
March 18 - Bryan Marker:Some tests are expected to fail because they require SendRecv, whichisn’t in the old code you have.
http://www.cs.utexas.edu/users/flame/ 85
![Page 156: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/156.jpg)
Fighting the War on Parallel Programming Error Distributed Memory Parallel
An Exercise in Portability: Elemental → SCC
Jan 11 - Tim Mattson:We will send the emulator
Jan 31 - Bryan Marker:I’m almost ready for you to test the stationary C Gemm (NN)
implementation in [Elemental] on an actual [SCC] board.
Feb 3 - Bryan Marker:Alright gentlemen, I have two test programs for Gemm C NN (one I createdand one by Jack).
Feb 4 - Bryan Marker:FYI, I have the Cholesky variant 2 ported [to the emulator]. Very easy fixto avoid SendRecv.
(Many weeks of no progress while everyone was busy with other things)
March 18 - Rob van der Wijngaart (Intel):Good news, [...] the app is running on SCC as we speak. Some of thetests inside the app are reporting failures, but these can now be debugged.
March 18 - Bryan Marker:Some tests are expected to fail because they require SendRecv, whichisn’t in the old code you have.
http://www.cs.utexas.edu/users/flame/ 85
![Page 157: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/157.jpg)
Fighting the War on Parallel Programming Error Distributed Memory Parallel
An Exercise in Portability: Elemental → SCC
Jan 11 - Tim Mattson:We will send the emulator
Jan 31 - Bryan Marker:I’m almost ready for you to test the stationary C Gemm (NN)
implementation in [Elemental] on an actual [SCC] board.
Feb 3 - Bryan Marker:Alright gentlemen, I have two test programs for Gemm C NN (one I createdand one by Jack).
Feb 4 - Bryan Marker:FYI, I have the Cholesky variant 2 ported [to the emulator]. Very easy fixto avoid SendRecv.
(Many weeks of no progress while everyone was busy with other things)
March 18 - Rob van der Wijngaart (Intel):Good news, [...] the app is running on SCC as we speak. Some of thetests inside the app are reporting failures, but these can now be debugged.
March 18 - Bryan Marker:Some tests are expected to fail because they require SendRecv, whichisn’t in the old code you have.
http://www.cs.utexas.edu/users/flame/ 85
![Page 158: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/158.jpg)
Fighting the War on Parallel Programming Error Distributed Memory Parallel
An Exercise in Portability: Elemental → SCC
Jan 11 - Tim Mattson:We will send the emulator
Jan 31 - Bryan Marker:I’m almost ready for you to test the stationary C Gemm (NN)
implementation in [Elemental] on an actual [SCC] board.
Feb 3 - Bryan Marker:Alright gentlemen, I have two test programs for Gemm C NN (one I createdand one by Jack).
Feb 4 - Bryan Marker:FYI, I have the Cholesky variant 2 ported [to the emulator]. Very easy fixto avoid SendRecv.
(Many weeks of no progress while everyone was busy with other things)
March 18 - Rob van der Wijngaart (Intel):Good news, [...] the app is running on SCC as we speak. Some of thetests inside the app are reporting failures, but these can now be debugged.
March 18 - Bryan Marker:Some tests are expected to fail because they require SendRecv, whichisn’t in the old code you have.
http://www.cs.utexas.edu/users/flame/ 85
![Page 159: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/159.jpg)
Fighting the War on Parallel Programming Error Distributed Memory Parallel
An Exercise in Portability: Elemental → SCC
Jan 11 - Tim Mattson:We will send the emulator
Jan 31 - Bryan Marker:I’m almost ready for you to test the stationary C Gemm (NN)
implementation in [Elemental] on an actual [SCC] board.
Feb 3 - Bryan Marker:Alright gentlemen, I have two test programs for Gemm C NN (one I createdand one by Jack).
Feb 4 - Bryan Marker:FYI, I have the Cholesky variant 2 ported [to the emulator]. Very easy fixto avoid SendRecv.
(Many weeks of no progress while everyone was busy with other things)
March 18 - Rob van der Wijngaart (Intel):Good news, [...] the app is running on SCC as we speak. Some of thetests inside the app are reporting failures, but these can now be debugged.
March 18 - Bryan Marker:Some tests are expected to fail because they require SendRecv, whichisn’t in the old code you have.
http://www.cs.utexas.edu/users/flame/ 85
![Page 160: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/160.jpg)
Fighting the War on Parallel Programming Error Distributed Memory Parallel
An Exercise in Portability: Elemental → SCC
Jan 11 - Tim Mattson:We will send the emulator
Jan 31 - Bryan Marker:I’m almost ready for you to test the stationary C Gemm (NN)
implementation in [Elemental] on an actual [SCC] board.
Feb 3 - Bryan Marker:Alright gentlemen, I have two test programs for Gemm C NN (one I createdand one by Jack).
Feb 4 - Bryan Marker:FYI, I have the Cholesky variant 2 ported [to the emulator]. Very easy fixto avoid SendRecv.
(Many weeks of no progress while everyone was busy with other things)
March 18 - Rob van der Wijngaart (Intel):Good news, [...] the app is running on SCC as we speak. Some of thetests inside the app are reporting failures, but these can now be debugged.
March 18 - Bryan Marker:Some tests are expected to fail because they require SendRecv, whichisn’t in the old code you have.
http://www.cs.utexas.edu/users/flame/ 85
![Page 161: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/161.jpg)
Fighting the War on Parallel Programming Error Distributed Memory Parallel
An Exercise in Portability: Elemental → SCC
Jan 11 - Tim Mattson:We will send the emulator
Jan 31 - Bryan Marker:I’m almost ready for you to test the stationary C Gemm (NN)
implementation in [Elemental] on an actual [SCC] board.
Feb 3 - Bryan Marker:Alright gentlemen, I have two test programs for Gemm C NN (one I createdand one by Jack).
Feb 4 - Bryan Marker:FYI, I have the Cholesky variant 2 ported [to the emulator]. Very easy fixto avoid SendRecv.
(Many weeks of no progress while everyone was busy with other things)
March 18 - Rob van der Wijngaart (Intel):Good news, [...] the app is running on SCC as we speak. Some of thetests inside the app are reporting failures, but these can now be debugged.
March 18 - Bryan Marker:Some tests are expected to fail because they require SendRecv, whichisn’t in the old code you have.
http://www.cs.utexas.edu/users/flame/ 85
![Page 162: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/162.jpg)
Fighting the War on Parallel Programming Error Distributed Memory Parallel
What was required?
Replace MPI layer with Intel’s experimental RCCEcommunication layer (Bryan Marker).
Write a few collective communication routines for RCCE(Ernie Chan).
Important: Great confidence in the implementation.
Note: SuperMatrix port to SCC is almost complete.
We are eagerly waiting for performance results.
http://www.cs.utexas.edu/users/flame/ 86
![Page 163: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/163.jpg)
Other Things I Could Talk About
1 Introduction
2 Notation
3 Deriving Algorithms to be Correct
4 From Correct Algorithm to Correct Code
5 Achieving High Performance
6 Fighting the War on Parallel Programming ErrorMultithreaded ArchitecturesDistributed Memory Parallel
7 Other Things I Could Talk About
8 How Do I Get to Use All This?
9 Conclusion
http://www.cs.utexas.edu/users/flame/ 87
![Page 164: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/164.jpg)
Other Things I Could Talk About
FLAME/C + GPU
SuperMatrix + Out-of-Core
SuperMatrix + GPU
SuperMatrix + MultiGPU
SuperMatrix + Out-of-Core + MultiGPU
PLAPACK + GPU
New algorithms for algorithms-by-blocks
Weapons of Math Induction for the War on Numerical ErrorAnalysis
Weapons of Math Induction for iterative methods
Mechanical derivation of algorithms
Mechanical translation of FLAME/C code to lower level code
libflame, the library
http://www.cs.utexas.edu/users/flame/ 88
![Page 165: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/165.jpg)
Other Things I Could Talk About
FLAME/C + GPU
SuperMatrix + Out-of-Core
SuperMatrix + GPU
SuperMatrix + MultiGPU
SuperMatrix + Out-of-Core + MultiGPU
PLAPACK + GPU
New algorithms for algorithms-by-blocks
Weapons of Math Induction for the War on Numerical ErrorAnalysis
Weapons of Math Induction for iterative methods
Mechanical derivation of algorithms
Mechanical translation of FLAME/C code to lower level code
libflame, the library
http://www.cs.utexas.edu/users/flame/ 88
![Page 166: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/166.jpg)
Other Things I Could Talk About
FLAME/C + GPU
SuperMatrix + Out-of-Core
SuperMatrix + GPU
SuperMatrix + MultiGPU
SuperMatrix + Out-of-Core + MultiGPU
PLAPACK + GPU
New algorithms for algorithms-by-blocks
Weapons of Math Induction for the War on Numerical ErrorAnalysis
Weapons of Math Induction for iterative methods
Mechanical derivation of algorithms
Mechanical translation of FLAME/C code to lower level code
libflame, the library
http://www.cs.utexas.edu/users/flame/ 88
![Page 167: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/167.jpg)
Other Things I Could Talk About
FLAME/C + GPU
SuperMatrix + Out-of-Core
SuperMatrix + GPU
SuperMatrix + MultiGPU
SuperMatrix + Out-of-Core + MultiGPU
PLAPACK + GPU
New algorithms for algorithms-by-blocks
Weapons of Math Induction for the War on Numerical ErrorAnalysis
Weapons of Math Induction for iterative methods
Mechanical derivation of algorithms
Mechanical translation of FLAME/C code to lower level code
libflame, the library
http://www.cs.utexas.edu/users/flame/ 88
![Page 168: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/168.jpg)
Other Things I Could Talk About
FLAME/C + GPU
SuperMatrix + Out-of-Core
SuperMatrix + GPU
SuperMatrix + MultiGPU
SuperMatrix + Out-of-Core + MultiGPU
PLAPACK + GPU
New algorithms for algorithms-by-blocks
Weapons of Math Induction for the War on Numerical ErrorAnalysis
Weapons of Math Induction for iterative methods
Mechanical derivation of algorithms
Mechanical translation of FLAME/C code to lower level code
libflame, the library
http://www.cs.utexas.edu/users/flame/ 88
![Page 169: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/169.jpg)
Other Things I Could Talk About
FLAME/C + GPU
SuperMatrix + Out-of-Core
SuperMatrix + GPU
SuperMatrix + MultiGPU
SuperMatrix + Out-of-Core + MultiGPU
PLAPACK + GPU
New algorithms for algorithms-by-blocks
Weapons of Math Induction for the War on Numerical ErrorAnalysis
Weapons of Math Induction for iterative methods
Mechanical derivation of algorithms
Mechanical translation of FLAME/C code to lower level code
libflame, the library
http://www.cs.utexas.edu/users/flame/ 88
![Page 170: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/170.jpg)
Other Things I Could Talk About
FLAME/C + GPU
SuperMatrix + Out-of-Core
SuperMatrix + GPU
SuperMatrix + MultiGPU
SuperMatrix + Out-of-Core + MultiGPU
PLAPACK + GPU
New algorithms for algorithms-by-blocks
Weapons of Math Induction for the War on Numerical ErrorAnalysis
Weapons of Math Induction for iterative methods
Mechanical derivation of algorithms
Mechanical translation of FLAME/C code to lower level code
libflame, the library
http://www.cs.utexas.edu/users/flame/ 88
![Page 171: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/171.jpg)
How Do I Get to Use All This?
1 Introduction
2 Notation
3 Deriving Algorithms to be Correct
4 From Correct Algorithm to Correct Code
5 Achieving High Performance
6 Fighting the War on Parallel Programming ErrorMultithreaded ArchitecturesDistributed Memory Parallel
7 Other Things I Could Talk About
8 How Do I Get to Use All This?
9 Conclusion
http://www.cs.utexas.edu/users/flame/ 89
![Page 172: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/172.jpg)
How Do I Get to Use All This?
Available as a Professionally Maintained Library
libflame Version 4.0 - Feb. 2010:http://www.cs.utexas.edu/users/flame/
Functionality that is a considerable subset of LAPACK
LAPACK compatibility layer
Linux and Windows OS
Field G. Van Zee. libflame: The Complete Reference.www.lulu.com, 2009
Elemental: http://code.google.com/p/elemental/ (soonto be incorporated in libflame
http://www.cs.utexas.edu/users/flame/ 90
![Page 173: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/173.jpg)
Conclusion
1 Introduction
2 Notation
3 Deriving Algorithms to be Correct
4 From Correct Algorithm to Correct Code
5 Achieving High Performance
6 Fighting the War on Parallel Programming ErrorMultithreaded ArchitecturesDistributed Memory Parallel
7 Other Things I Could Talk About
8 How Do I Get to Use All This?
9 Conclusion
http://www.cs.utexas.edu/users/flame/ 91
![Page 174: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/174.jpg)
Conclusion
“How do we convince people that in programming simplicity andclarity – short: what mathematicians call ”elegance” – not adispensable luxury, but a crucial matter that decides betweensuccess and failure?”
– Dijkstra
http://www.cs.utexas.edu/users/flame/ 92
![Page 175: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/175.jpg)
Conclusion
A Success Story
Practical application of “goal-oriented programming”
For the domain of dense linear algebra libraries,FLAME+SuperMatrix appears to solve the programmabilityproblem for sequential and multicore
For the domain of distributed memory dense linear algebralibraries, Elemental appears to solve the programmabilityproblem for clusters and many-core
http://www.cs.utexas.edu/users/flame/
http://www.cs.utexas.edu/users/flame/ 93
![Page 176: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/176.jpg)
Conclusion
What is next?
My favorate definition of science:“Knowledge that has been reduced to a system”
How can one represent knowledge about linear algebraalgorithms?
How can want systematically perform architecture specifictransformations with this knowledge?
Don’t code the library. Encode the expert knowledge.
http://www.cs.utexas.edu/users/flame/ 94
![Page 177: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/177.jpg)
Conclusion
What is next?
My favorate definition of science:“Knowledge that has been reduced to a system”
How can one represent knowledge about linear algebraalgorithms?
How can want systematically perform architecture specifictransformations with this knowledge?
Don’t code the library. Encode the expert knowledge.
http://www.cs.utexas.edu/users/flame/ 94
![Page 178: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/178.jpg)
Conclusion
What is next?
My favorate definition of science:“Knowledge that has been reduced to a system”
How can one represent knowledge about linear algebraalgorithms?
How can want systematically perform architecture specifictransformations with this knowledge?
Don’t code the library. Encode the expert knowledge.
http://www.cs.utexas.edu/users/flame/ 94
![Page 179: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/179.jpg)
Conclusion
What is next?
My favorate definition of science:“Knowledge that has been reduced to a system”
How can one represent knowledge about linear algebraalgorithms?
How can want systematically perform architecture specifictransformations with this knowledge?
Don’t code the library. Encode the expert knowledge.
http://www.cs.utexas.edu/users/flame/ 94
![Page 180: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/180.jpg)
Conclusion
What is next?
My favorate definition of science:“Knowledge that has been reduced to a system”
How can one represent knowledge about linear algebraalgorithms?
How can want systematically perform architecture specifictransformations with this knowledge?
Don’t code the library. Encode the expert knowledge.
http://www.cs.utexas.edu/users/flame/ 94
![Page 181: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/181.jpg)
Conclusion
Want to learn more?
http://www.cs.utexas.edu/users/flame/publications
http://www.cs.utexas.edu/users/flame/ 95
![Page 182: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/182.jpg)
Conclusion
Questions?
http://www.cs.utexas.edu/users/flame/ 96
![Page 183: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/183.jpg)
Conclusion
Will this be embraced?
Simplicity is a great virtue but it requires hard work to achieve itand education to appreciate it. And to make matters worse:complexity sells better.
– Dijkstra
http://www.cs.utexas.edu/users/flame/ 97
![Page 184: Weapons of Math Induction for the War on Parallel ... · Robert van de Geijn Field Van Zee Graduate Students Bryan Marker Kyungjoo Kim Isaac Lee Ardavan Pedram Jack Poulson Martin](https://reader033.vdocument.in/reader033/viewer/2022052014/602b0becc36c37454175ae96/html5/thumbnails/184.jpg)
Conclusion
Will this be embraced?
Simplicity is a great virtue but it requires hard work to achieve itand education to appreciate it. And to make matters worse:complexity sells better.
– Dijkstra
http://www.cs.utexas.edu/users/flame/ 97