high performance matrix computations/calcul matriciel haute
TRANSCRIPT
![Page 1: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/1.jpg)
High Performance Matrix Computations/CalculMatriciel Haute Performance
J.-Y. L’Excellent (INRIA/LIP-ENS Lyon)[email protected]
en collab. avec P. Amestoy, M. Dayde, L. Giraud (ENSEEIHT-IRIT)
2007-2008
1/ 627
![Page 2: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/2.jpg)
Outline
IntroductionIntroduction aux calculateurs haute-performanceEvolutions architecturalesProgrammationCalcul Matriciel et calcul haute performanceGrid computing - Internet ComputingConclusion
2/ 627
![Page 3: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/3.jpg)
IntroductionIntroduction aux calculateurs haute-performanceEvolutions architecturalesProgrammationCalcul Matriciel et calcul haute performanceGrid computing - Internet ComputingConclusion
3/ 627
![Page 4: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/4.jpg)
I Interets du Calcul Haute-PerformanceI Applications temps-critiqueI Cas de calcul plus grosI Diminution du temps de reponseI Minimisation des couts de calcul
I Difficultes
I Acces aux donnees : hierarchie memoire complexe→ Exploiter la localite des references aux donnees
I Identification et gestion du parallelisme dans une application→ Approche algorithmique
4/ 627
![Page 5: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/5.jpg)
Systemes paralleles : enfin l’age adulte !
I Les machines les plus puissantes sont a haut degre deparallelisme
I Le rapport prix / performance est attractif
I Plus que quelques constructeurs dans la course
I Systemes plus stables
I Logiciels applicatifs et librairies disponibles
I Exploitation industrielle et commerciale : plus uniquementlaboratoires de recherche
I Mais : travail algorithmique important etvalidation/maintenance difficile.
Nouvelles evolutions:I 1 core per chip → multi-core chips
I supercomputing → metacomputing (“grid computing”)
5/ 627
![Page 6: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/6.jpg)
Classes de calculateursI Serveurs de calcul :
I Utilisables sur une large gamme d’applicationsI Multiprogrammation et temps partageI Stations de travail, serveurs departementaux, centre de calcul
I Calculateurs plus specifiques :I Efficaces sur une classe plus limitee de problemes (haut degre
de parallelisme)I A cause de leur architecture ou de limitations du logicielI Par exemple architectures massivement paralleles (MPP,
clusters de PC,.....)I Gains importants possibles avec rapport cout-performance
interessantI Calculateurs specialises :
I Resolution d’un probleme (image processing, crash test, . . . )I Hardware et logiciels concus pour cette application-cibleI Gains tres importants possibles avec un rapport
cout-performance tres interessantI Par exemple, la machine MDGRAPE-3 (dynamique
moleculaire) installee au Japon atteint 1 PFlop/s !
6/ 627
![Page 7: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/7.jpg)
Besoins dans le domaine du calcul scientifique
Science traditionnelle
1. Construire une theorie,
2. Effectuer des experiences ou construire un systeme.
I trop difficile (ex: souffleries de grandes tailles)
I trop cher (fabriquer un avion juste pour quelques experimentations)
I trop lent (attente de l’evolution du climat / de l’univers)
I trop dangereux (armes, medicaments, experimentations sur leclimat)
Calcul scientifique
I simuler le comportement de systemes complexes grace a lasimulation numerique.
I lois physiques + algorithmes numeriques + calculateurs hauteperformance
7/ 627
![Page 8: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/8.jpg)
Exemples dans le domaine du calcul scientifique
I Contraintes de duree: prevision du climat
8/ 627
![Page 9: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/9.jpg)
Quelques exemples dans le domaine du calculscientifique
I Cost constraints: wind tunnels, crash simulation, . . .
9/ 627
![Page 10: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/10.jpg)
Scale Constraints
I large scale: climate modelling, pollution, astrophysics
I tiny scale: combustion, quantum chemistry
10/ 627
![Page 11: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/11.jpg)
Pourquoi des traitements paralleles ?
I Besoins de calcul non satisfaits dans beaucoup de disciplines(pour resoudre des problemes significatifs)
I Performance uniprocesseur proche des limites physiques
Temps de cycle 0.5 nanoseconde↔ 4 GFlop/s (avec 2 operations flottantes / cycle)
I Calculateur 20 TFlop/s ⇒ 5000 processeurs→calculateurs massivement paralleles
I Pas parce que c’est le plus simple mais parce que c’estnecessaire
I Objectif actuel (2010):
supercalculateur a 3 PFlop/s,500 TBytes de memoire ?
11/ 627
![Page 12: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/12.jpg)
Quelques unites pour le calcul haute performance
Vitesse
1 MFlop/s 1 Megaflop/s 106 operations / seconde1 GFlop/s 1 Gigaflop/s 109 operations / seconde1 TFlop/s 1 Teraflop/s 1012 operations / seconde1 PFlop/s 1 Petaflop/s 1015 operations / seconde
Memoire
1 kB / 1 ko 1 kilobyte 103 octets1 MB / 1 Mo 1 Megabyte 106 octets1 GB / 1 Go 1 Gigabyte 109 octets1 TB / 1 To 1 Terabyte 1012 octets1 PB / 1 Po 1 Petabyte 1015 octets
12/ 627
![Page 13: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/13.jpg)
Mesures de performance
I Nombre d’operations flottantes par seconde (pas MIPS)I Performance crete :
I Ce qui figure sur la publicite des constructeursI Suppose que toutes les unites de traitement sont activesI On est sur de ne pas aller plus vite :
Performance crete = #unites fonctionnellesclock (sec.)
I Performance reelle :I Habituellement tres inferieure a la precedente
Malheureusement
13/ 627
![Page 14: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/14.jpg)
Rapport (Performance reelle / performance de crete) souvent bas !!Soit P un programme :
1. Processeur sequentiel:I 1 unite scalaire (1 GFlop/s)I Temps d’execution de P : 100 s
2. Machine parallele a 100 processeurs:I Chaque processor: 1 GFlop/sI Performance crete: 100 GFlop/s
3. Si P : code sequentiel (10%) + code parallelise (90%)I Temps d’execution de P : 0.9 + 10 = 10.9 sI Performance reelle : 9.2 GFlop/s
4. Performance reellePerformance de crete = 0.1
14/ 627
![Page 15: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/15.jpg)
Loi d’Amdahl
I fs fraction d’une application qui ne peut pas etre parallelisee
fp = 1− fs fraction du code parallelise
N: nombre de processeurs
I Loi d’Amdahl:
tN ≥ (fpN + fs)t1 ≥ fst1
Speed-up: S = t1tN≤ 1
fs+fpN
≤ 1fs
Sequential Parallel
t3 t2 t1t∞= fst1
15/ 627
![Page 16: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/16.jpg)
Calculateur procs LINPACK LINPACK Perf.n = 100 n = 1000 crete
Intel WoodCrest (1 core, 3GHz) 1 3018 6542 12000HP ProLiant (1 core, 3.8GHz) 1 1852 4851 7400HP ProLiant (1 core, 3.8GHz) 2 8197 14800IBM eServer(1.9GHz, Power5) 1 1776 5872 7600IBM eServer(1.9GHz, Power5) 8 34570 60800Fujitsu Intel Xeon (3.2GHz) 1 1679 3148 12800Fujitsu Intel Xeon (3.2GHz) 2 5151 6400SGI Altix (1.5GHz Itanium2) 1 1659 5400 6000NEC SX-8 (2 GHz) 1 2177 14960 16000Cray T932 32 1129 (1 proc.) 29360 57600Hitachi S-3800/480 4 408 (1 proc.) 20640 32000
Table: Performance (MFlop/s) sur la resolution d’un systeme d’equationslineaires (d’apres LINPACK Benchmark Dongarra [07])
16/ 627
![Page 17: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/17.jpg)
1980 1988 1991 1993 1995 and beyond
2D airfoil10MB
100MB
Oil Reservoir
48-Hour Weather Modelling
1 GB
1 TB
10GB
100 GB
72-Hour
VehiculeSignature
Weather
3D Plasma
ModellingChemical Dynamics
Pharmaceutical Design
100 MFlops 1 GFlops 10 GFlops 100 GFlops 1 TFlops
StructuralBiology
Global ChangeHuman GenomeFkuid TurbulenceVehical DynamicsOcean CirculationViscous Fluid DynamicsSuperconductor ModellingStructural BiologyQuantum ChromodynamicsVision
Figure: Grand challenge problems (1995).
17/ 627
![Page 18: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/18.jpg)
Machine Probleme de Probleme depetite taille grande taille
PFlop/s computer - 36 secondesTFlop/s computer 2 secondes 10 heuresCM2 64K 30 minutes 1 anCRAY-YMP-8 4 heures 10 ansALLIANT FX/80 5 jours 250 ansSUN 4/60 1 mois 1500 ansVAX 11/780 9 mois 14,000 ansIBM AT 9 ans 170,000 ansAPPLE MAC 23 ans 450,000 ans
Table: Vitesse de certains calculateurs sur un probleme Grand Challengeen 1995 (d’apres J.J. Dongarra)
Depuis, les problemes “Grand Challenge” ont grossi !
18/ 627
![Page 19: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/19.jpg)
Machine Probleme de Probleme depetite taille grande taille
PFlop/s computer - 36 secondesTFlop/s computer 2 secondes 10 heuresCM2 64K 30 minutes 1 anCRAY-YMP-8 4 heures 10 ansALLIANT FX/80 5 jours 250 ansSUN 4/60 1 mois 1500 ansVAX 11/780 9 mois 14,000 ansIBM AT 9 ans 170,000 ansAPPLE MAC 23 ans 450,000 ans
Table: Vitesse de certains calculateurs sur un probleme Grand Challengeen 1995 (d’apres J.J. Dongarra)
Depuis, les problemes “Grand Challenge” ont grossi !
18/ 627
![Page 20: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/20.jpg)
IntroductionIntroduction aux calculateurs haute-performanceEvolutions architecturalesProgrammationCalcul Matriciel et calcul haute performanceGrid computing - Internet ComputingConclusion
19/ 627
![Page 21: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/21.jpg)
Evolutions architecturales: historique
Pour 1,000 $ : calculateur personnel plus performant, avecplus de memoire et plus de disque qu’un calculateur desannees 70 avec 1,000,000 $
technologie et conception !
I Durant les 25 premieres annees de l’informatique progres :technologie et architecture
I Depuis les annees 70 :I conception basee sur les circuits integresI performance : +25-30% par an pour les “mainframes” et minis
qui dominaient l’industrie
I Depuis la fin des annees 70 : emergence du microprocesseurI meilleure exploitation des avancees dans l’integration que pour
les mainframes et les minis (integration moindre)I progression et avantage de cout (production de masse) : de
plus en plus de machines sont basees sur les microprocesseursI possibilite de pourcentage d’amelioration plus rapide = 35%
par an
20/ 627
![Page 22: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/22.jpg)
Evolutions architecturales: historique
I Deux changements sur le marche facilitent l’introduction denouvelles architectures :
1. utilisation decroissante de l’assembleur (compatibilite binairemoins importante)
2. systemes d’exploitation standards, independants desarchitectures (e.g. UNIX)
⇒ developpement d’un nouvel ensemble d’architectures :RISC a partir de 85
I performance : + 50% par an !!!I Consequences :
I plus de puissance :I Performance d’un PC > CRAY C90 (95)I Prix tres inferieur
I Domination des microprocesseursI PC, stations de travailI Minis remplaces par des serveurs a base de microprocesseursI Mainframes remplaces par des multiprocesseurs a faible
nombre de processeurs RISC (SMP)I Supercalculateurs a base de processeurs RISC (essentiellement
MPP)20/ 627
![Page 23: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/23.jpg)
Moore’s law
I Gordon Moore (co-fondateur d’Intel) a predit en 1965 que ladensite en transitors des circuits integres doublerait tous les24 mois.
I A aussi servi de but a atteindre pour les fabriquants.I A ete deforme:
I 24 → 18 moisI nombre de transistors → performance
21/ 627
![Page 24: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/24.jpg)
Comment accroıtre la vitesse de calcul ?
I Accelerer la frequence avec des technologies plus rapides
On atteint les limites:I Conception des pucesI Consommation electrique et chaleur dissipeeI Refroidissement ⇒ probleme d’espace
I On peut encore miniaturiser, mais:I pas indefinimentI resistance des conducteurs (R = ρ×l
s ) augmente et ..I la resistance est responsable de la dissipation d’energie (effet
Joule).I effets de capacites difficiles a maıtriser
Remarque: 1 nanoseconde = temps pour qu’un signalparcourt 30 cm de cable
I Temps de cycle 1 nanosecond ↔ 2 GFlop/s (avec 2operations flottantes par cycle)
22/ 627
![Page 25: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/25.jpg)
Seule solution: le parallelisme
I parallelisme: execution simultanee de plusieurs instructions al’interieur d’un programme
I A l’interieur d’un processeur :I micro-instructionsI traitement pipelineI recouvrement d’instructions executees par des unites distinctes
→ transparent pour le programmeur(gere par le compilateur ou durant l’execution)
I Entre des processeurs ou cœurs distincts:I suites d’instructions differentes executees
→ synchronisations implicites (compilateur, parallelisationautomatique) ou explicites (utilisateur)
23/ 627
![Page 26: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/26.jpg)
Unites centrales haute-performance
Concept cle: Traitement pipeline :
I L’execution d’une operation (arithmetique) est decomposee enplusieurs sous-operations
I Chaque sous-operation est executee par une unitefonctionnelle dediee = etage (travail a la chaine)
I Exemple pour une operations diadique (a← b × c) :
T1. Separer mantisse et exposantT2. Multiplier mantissesT3. Additionner les exposantsT4. Normaliser le resultatT5. Ajouter signe au resultat
24/ 627
![Page 27: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/27.jpg)
Exemple pour des operations diadiques (suite)
I Supposition: l’operation a← b × c s’effectue en 5 traitementselementaires T1,T2,. . . ,T5 d’un cycle chacun. Quel est lenombre de cycles processeur pour la boucle suivante ?
Pour i = 1 a NA(i) = B(i) * C(i)
Fin Pour
I Traitement non pipeline: N * 5 cyclesI Traitement pipeline (a la chaine): N + 5 cycles
I 1er cycle: T1(1)I 2eme cycle: T1(2), T2(1)I 3eme cycle: T1(3), T2(2), T3(1)I . . .I keme cycle: T1(k), T2(k-1), T3(k-2), T4(k-3), T5(k-4)I . . .
25/ 627
![Page 28: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/28.jpg)
Impact de l’approche CRAY
L’approche CRAY (annees 80) a eu un grand impact sur laconception des supercalculateurs :
I horloge la plus rapide possible
I unite vectorielle pipelinee sophistiquee
I registres vectoriels
I memoire tres haute performance
I multiprocesseurs a memoire partageeI processeurs vectoriels
I exploitent la regularite des traitements sur les elements d’unvecteur
I traitement pipelineI couramment utilises sur les supercalculateursI vectorisation par le compilateur
26/ 627
![Page 29: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/29.jpg)
Processeurs RISC
I Processeurs RISC : introduits sur le marche vers 1990“the attack of the killer micros”
I pipeline sur les operations scalairesI performance proche de celle des processeurs vectoriels a
frequence egaleI plus efficaces sur des problemes scalaires
I CISC (Complex Instruction Set Computer)I Efficacite par un meilleur encodage des instructions
I RISC (Reduced Instruction Set Computer)I Concept etudie fin des annees 70I Decroıtre le nombre de cycles par instruction a 1
Jeu d’instructions simple↓
Hardware simplifie↓
Temps de cycle plus faible
27/ 627
![Page 30: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/30.jpg)
I Idees maıtresses dans la conception des RISC :I Instructions decodees en 1 cycleI Uniquement l’essentiel realise au niveau du hardwareI Interface load/store avec la memoireI Utilise intensivement le principe du pipeline pour obtenir un
resultat par cycle meme pour les operations complexesI Hierarchie memoire haute-performanceI Format d’instructions simpleI RISC super scalaires ou superpipelines: plusieurs unites
fonctionnelles
28/ 627
![Page 31: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/31.jpg)
Calculateur procs LINPACK LINPACK Performancen = 100 n = 1000 crete
Intel WoodCrest (1 core, 3GHz) 1 3018 6542 12000HP ProLiant (1 core, 3.8GHz) 1 1852 4851 7400IBM eServer(1.9GHz, Power5) 1 1776 5872 7600SGI Altix (1.6GHz Itanium2) 1 1765 5943 6400AMD Opteron (2.19GHz) 1 1253 3145 4284Fujitsu Intel Xeon (3.2GHz) 1 1679 3148 12800AMD Athlon (1GHz) 1 832 1705 3060Compaq ES45 (1GHz) 1 824 1542 2000
Performance actuelle d’un processeur vectorielNEC SX-8 (2 GHz) 1 2177 14960 16000NEC SX-8 (2 GHz) 8 75140 128000
Table: Performance des processseurs RISC (LINPACK BenchmarkDongarra [07])
29/ 627
![Page 32: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/32.jpg)
Architectures multi-cœurs
ConstatsI La quantite de composants / puce va continuer a augmenter
I La frequence ne peut plus augmenter beaucoup(chaleur/refroidissement)
I Il est difficile de trouver suffisamment de parallelisme dans leflot d’instructions d’un processus
Multi-cœursI plusieurs cœurs a l’interieur d’un meme processeur
I vus comme plusieurs processeurs logiques par l’utilisateur
I Mais: multi-threading necessaire au niveau de l’application
30/ 627
![Page 33: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/33.jpg)
Processeur Cell
I La PS3 est basee sur un processeurCell (Sony,Toshiba,IBM)
I 1 Cell= un Power PC + 8 SPE(Synergetic Process. Elem.)
I 1 SPE = processeur vectoriel SIMD+ DMA = 25.6 GFlop/s
I 204 GFlop/s de performance creteen arithmetique 32 bits
(14.6 GFlop/s en 64 bits)
I D’ou regain d’interet pour le calcul en 32 bitsI Melange d’arithmetiques simple et double precision (voir [?])I Typiquement: 32-bit pour le gros des calculs, 64 bits pour
ameliorer la precisionI Pas seulement sur processeur Cell
![Page 34: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/34.jpg)
Example of mixed-precision arithmetic
I Solve Ax = b, A sparse with the sparse direct solver MUMPSI Compare single precision + iterative refinement to double
precision run (Number of steps of iterative refinementsindicated on Figure).
Speed-up obtained wrt double precision(Results from A. Buttari et.al., 2007)
32/ 627
![Page 35: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/35.jpg)
Annee Calculateur MFlop/s1955-65 CDC 6600 1-101965-75 CDC 7600 10 - 100
IBM 370/195ILLIAC IV
1975-85 CRAY-1, XMP, CRAY 2 100 - 1000CDC CYBER 205FUJITSU VP400
NEC SX-21985-1995 CRAY-YMP, C90 1000 - 100,000
ETA-10NEC SX-3
FUJITSU VP26001995-2005 CRAY T3E 1.2 TFlop/s
INTEL 1.8 TFlop/sIBM SP 16 TFlop/s
HP 20 TFlop/sNEC 40 TFlop/s
IBM Blue Gene 180 TFlop/s2008 - Roadrunner 1 PFlop/s
Table: Evolutions des performances par decennie
![Page 36: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/36.jpg)
Problemes
I On est souvent (en pratique) a 10% de la performance creteI Processeurs plus rapides → acces aux donnees plus rapide :
I organisation memoire,I communication inter-processeurs
I Hardware plus complexe : pipe, technologie, reseau, . . .
I Logiciel plus complexe : compilateur, systeme d’exploitation,langages de programmation, gestion du parallelisme,. . . applications
Il devient plus difficile de programmer efficacement
34/ 627
![Page 37: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/37.jpg)
Vitesse memoire vs vitesse processeur
I Performance processeur: + 60% par an
I Memoire DRAM: + 9% par an
→ Ratio performance processeurtemps acces memoire augmente d’environ 50% par an !!
35/ 627
![Page 38: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/38.jpg)
Problemes de debit memoire
I L’acces aux donnees est un probleme crucial dans lescalculateurs modernes
I Accroıssement de la vitesse de calcul sans accroıtre le debitmemoire → goulet d’etranglement
MFlop/s plus faciles que MB/s pour debit memoire
I
Temps de cyle processeurs → 2 GHz (.5 ns)Temps de cycle memoire → ≈ 20 ns SRAM
≈ 50 ns DRAM
36/ 627
![Page 39: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/39.jpg)
Comment obtenir de hauts debits memoire ?
I Plusieurs chemins d’acces entre memoire et processeursI CRAY XMP et YMP :
I 2 vector load + 1 vector store + 1 I/OI utilises pour acceder des vecteurs distincts
I NEC SX :I chemins d’acces multiples peuvent etre aussi utilises pour
charger un vecteur
I (ameliore le debit, mais pas la latence !)
I Plusieurs modules memoire accedes simultanement(entrelacage)
I Acces memoire pipelines
I Memoire organisee hierarchiquementI La facon d’acceder aux donnees peut affecter la performance:
I Minimiser les defauts de cacheI Minimiser la pagination memoireI Localite: ameliorer le rapport references a des memoires
locales/ references a des memoires a distance
37/ 627
![Page 40: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/40.jpg)
Cache level #2
Cache level #1 1−2 / 8 − 66
6−15 / 30 − 200
Main memory 10 − 100
Remote memory 500 − 5000
Registers < 1
256 KB − 16 MB
1 − 128 KB
Average access time (# cycles) hit/missSize
Disks 700,000 / 6,000,000
1 − 10 GB
Figure: Exemple de hierarchie memoire.
38/ 627
![Page 41: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/41.jpg)
Conception memoire pour nombre important deprocesseurs ?
Comment 100 processeurs peuvent-ils avoir acces a des donneesrangees dans une memoire partagee (technologie, interconnexion,prix ?)→ Solution a cout raisonnable : memoire physiquement distribuee(chaque processeur a sa propre memoire locale)
I 2 solutions :I memoires locales globalement adressables : Calulateurs a
memoire partagee virtuelleI transferts explicites des donnees entre processeurs avec
echanges de messagesI Scalibite impose :
I augmentation lineaire debit memoire / vitesse du processeurI augmentation du debit des communications / nombre de
processeurs
I Rapport cout/performance → memoire distribuee et bonrapport cout/performance sur les processeurs
39/ 627
![Page 42: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/42.jpg)
Architecture des multiprocesseurs
Nombre eleve de processeurs → memoire physiquement distribuee
Organisation Organisation physiquelogique Partagee (32 procs max) DistribueePartagee multiprocesseurs espace d’adressage global
a memoire partagee (hard/soft) au dessus de messagesmemoire partagee virtuelle
Distribuee emulation de messages echange de messages(buffers)
Table: Organisation des processeurs
Remarque: standards de programmation
Organisation logique partagee: threads, OpenMPOrganisation logique distribuee: PVM, MPI, sockets
40/ 627
![Page 43: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/43.jpg)
P1 P2 P3 ......................... PnP4
Shared Memory
Interconnection Network
Figure: Exemple d’architecture a memoire partagee.
41/ 627
![Page 44: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/44.jpg)
P1 P2 P3 ......................... PnP4
Interconnection Network
LM LM LM LM LM.........................
Figure: Exemple d’architecture a memoire distribuee.
42/ 627
![Page 45: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/45.jpg)
Remarques
Memoire physiquement partagee
I Temps d’acces uniforme a toute la memoire
Memoire physiquement distribuee
I Temps d’acces depend de la localisation de la donnee
Memoire logiquement partagee
I Espace d’adressage unique
I Communications implicites entre les processeurs via lamemoire partagee
Memoire logiquement distribuee
I Plusieurs espaces d’adressage prives
I Communications explicites (messages)
43/ 627
![Page 46: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/46.jpg)
Terminologie
Architecture SMP (Symmetric Multi Processor)
I Memoire partagee (physiquement et logiquement)
I Temps d’acces identique a la memoire
I Similaire du point de vue applicatif aux architecturesmulti-cœurs (1 cœur = 1 processeur logique)
I Mais communications bcp plus rapides dans les multi-cœurs(latence < 3ns, bande passantee > 20 GB/s) que dans lesSMP (latence ≈ 60ns, bande passantee ≈ 2 GB/s)
Architecture NUMA (Non Uniform Memory Access)
I Memoire physiquement distribuee et logiquement partagee
I Plus facile d’augmenter le nombre de procs qu’en SMP
I Temps d’acces depend de la localisation de la donnee
I Acces locaux plus rapides qu’acces distants
I hardware permet la coherence des caches (ccNUMA)44/ 627
![Page 47: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/47.jpg)
Exemples
I Memoire physiquement et logiquement partagee (SMP):
la plupart des supercalculateurs a faible nombre deprocesseurs: stations de travail multi-processeurs (SUN:jusqu’a 64 processeurs, . . . ), NEC, SGI Power Challenge, . . .
I Memoire physiquement et logiquement distribuee:
grappes de PC monoprocesseurs, IBM SP2, T3D, T3E, . . .
I Memoire physiquement distribuee et logiquement partagee(NUMA):
BBN, KSR, SGI Origin, SGI Altix, . . .
45/ 627
![Page 48: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/48.jpg)
Clusters de multi-processeurs
I Plusieurs niveaux de memoire et de reseaux d’interconnexion→ temps d’acces non uniforme
I Memoire commune partagee par un faible nombre deprocesseurs (noeud SMP)
I Eventuellement des outils de programmation distincts(transferts de message entre les clusters, . . . )
I Exemples:I grappes de bi- ou quadri-processeurs,I IBM SP (CINES, IDRIS): plusieurs nœuds de 4 a 32 Power4+
. . .
46/ 627
![Page 49: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/49.jpg)
Memory Memory
Network
NetworkNetwork
LM LM LM LM
Proc Proc ProcProc
SMP node
Figure: Exemple d’architecture “clusterisee”.
47/ 627
![Page 50: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/50.jpg)
Reseaux de CalculateursI Evolution du calcul centralise vers un calcul distribue sur des
reseaux de calculateursI Puissance croissante des stations de travailI Interessant du point de vue coutI Processeurs identiques sur stations de travail et MPP
I Calcul parallele et calcul distribue peuvent converger :I modele de programmationI environnement logiciel : PVM, MPI, . . .
I Performance effective peut varier enormement sur uneapplication
I Heterogene / homogeneI Plutot oriente vers un parallelisme gros grain (taches
independentes, . . . )I Performance tres dependante des communications (debit et
latence)I Charge du reseau et des calculateurs variable pdt l’execution
Equilibrage des traitements ?
48/ 627
![Page 51: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/51.jpg)
network #1
computer #2
computer #1
cluster
network #2
multiprocessor
Figure: Exemple de reseau de calculateurs.
49/ 627
![Page 52: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/52.jpg)
Multiprocesseurs vs reseaux de machines
I Systemes distribues (reseaux de machines) : communicationsrelativement lentes et systemes independants
I Systemes paralleles (architectures multiprocesseur) :communications plus rapides (reseau d’interconnexion plusrapide) et systemes plus homogenes
Il y a convergence entre ces deux classes d’architectures et lafrontiere est floue :
I clusters et clusters de clusters
I des systemes d’exploitation repartis (ex: MACH et CHORUSOS) savent gerer les deux
I versions de UNIX multiprocesseur
I souvent memes environnements de developpement
50/ 627
![Page 53: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/53.jpg)
IntroductionIntroduction aux calculateurs haute-performanceEvolutions architecturalesProgrammationCalcul Matriciel et calcul haute performanceGrid computing - Internet ComputingConclusion
51/ 627
![Page 54: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/54.jpg)
Classification de Flynn
I S.I.S.D. : Single Instruction Single Data streamI architecture monoprocesseurI calculateur von Neumann conventionnelI exemples : SUN, PC
I S.I.M.D. : Single Instruction Multiple Data streamI processeurs executent de facon synchrone la meme instruction
sur des donnees differentes (e.g. elements d’un vecteur, d’unematrice, d’une image)
I une unite de controle diffuse les instructionsI processeurs identiquesI Exemples : CM-2, MasPar,. . .I plus recemment: chacun des 8 SPE du processeur CELL se
comporte comme un systeme SIMD
52/ 627
![Page 55: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/55.jpg)
I M.I.S.D. : n’existe pasI M.I.M.D. : Multiple Instructions Multiple Data stream
I processeurs executent de facon asynchrone des instructionsdifferentes sur des donnees differentes
I processeurs eventuellement heterogenesI chaque processeur a sa propre unite de controleI exemples : ALLIANT, CONVEX, CRAYs, IBM SP, clusters
BEOWULF, serveurs multi-processeurs, reseaux de stations detravail, . . .
53/ 627
![Page 56: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/56.jpg)
Modes de programmation SIMD et MIMD
I Avantages du SIMD :I Facilite de programmation et de debogageI Processeurs synchronises → couts de synchronisation
minimauxI Une seule copie du programmeI Decodage des instructions simple
I Avantages du MIMD :I Plus flexible, beaucoup plus generalI Exemples:
I memoire partagee: OpenMP, threads POSIXI memoire distribuee: PVM, MPI (depuis C/C++/Fortran)
54/ 627
![Page 57: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/57.jpg)
IntroductionIntroduction aux calculateurs haute-performanceEvolutions architecturalesProgrammationCalcul Matriciel et calcul haute performanceGrid computing - Internet ComputingConclusion
55/ 627
![Page 58: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/58.jpg)
Calcul Matriciel et calcul haute performance
I Demarche generale pour le calcul scientifique:
1. Probleme de simulation (probleme continu)2. Application de lois phyisques (Equations aux derivees
partielles)3. Discretisation, mise en equations en dimension finie4. Resolution de systemes lineaires (Ax = b)5. (Etude des resultats, remise en cause eventuelle du modele ou
de la methode)
I Resolution de systemes lineaires=noyau algorithmiquefondamental. Parametres a prendre en compte:
I Proprietes du systeme (symetrie, defini positif,conditionnement, sur-determine, . . . )
I Structure: dense ou creux,I Taille: plusieurs millions d’equations ?
56/ 627
![Page 59: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/59.jpg)
Equations aux derivees partielles
I Modelisation d’un phenomene physique
I Equaltions differentielles impliquant:I forcesI momentsI temperaturesI vitessesI energiesI temps
I Solutions analytiques rarement disponibles
57/ 627
![Page 60: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/60.jpg)
Exemples d’equations aux derivees partielles
I Trouver le potentiel electrique pour une distribution de chargedonnee:∇2ϕ = f ⇔ ∆ϕ = f , or∂2
∂x2ϕ(x , y , z) + ∂2
∂y2ϕ(x , y , z) + ∂2
∂z2ϕ(x , y , z) = f (x , y , z)
I Equation de la chaleur (ou equation de Fourier):∂2u
∂x2+∂2u
∂y 2+∂2u
∂z2=
1
α
∂u
∂tavec
I u = u(x , y , z , t): temperature,I α: diffusivite thermique du milieu.
I Equations de propagation d’ondes, equation de Schrodinger,Navier-Stokes,. . .
58/ 627
![Page 61: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/61.jpg)
Discretisation (etape qui suit la modelisationphysique)
Travail du numericien:
I Realisation d’un maillage
I Choix des methodes de resolution et etude de leurcomportement
I Etude de la perte d’information due au passage a la dimensionfinie
Principales techniques de discretisation
I Differences finies
I Elements finis
I Volumes finis
59/ 627
![Page 62: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/62.jpg)
Discretization with finite differences (1D)
I Basic approximation (ok if h is small enough):(du
dx
)(x) ≈ u(x + h)− u(x − h)
h
I Results from Taylor’s formula
u(x + h) = u(x) + hdu
dx+
h2
2
d2u
dx2+
h3
6
d3u
dx3+ O(h4)
I Replacing h by −h:
u(x − h) = u(x)− hdu
dx+
h2
2
d2u
dx2− h3
6
d3u
dx3+ O(h4)
I Thus:
d2u
dx2=
u(x + h)− 2u(x) + u(x − h)
h2+ O(h2)
60/ 627
![Page 63: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/63.jpg)
Discretization with finite differences (1D)
d2u
dx2=
u(x + h)− 2u(x) + u(x − h)
h2+ O(h2)
3-point stencil for the centered difference approximation tothe second order derivative:
−21 1
61/ 627
![Page 64: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/64.jpg)
Finite Differences for the Laplacian Operator (2D)
Assuming same mesh size h in x and y directions,
∆u(x) ≈ u(x − h, y)− 2u(x , y) + u(x + h), y
h2+
u(x , y − h)− 2u(x , y) + u(x , y + h)
h2
∆u(x) ≈ 1
h2(u(x−h, y)+u(x +h, y)+u(x , y−h)+u(x , y +h)−4u(x , y))
1 1
1
1
−4 −4
1 1
11
5-point stencils for the centered difference approximation tothe Laplacian operator (left) standard (right) skewed
62/ 627
![Page 65: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/65.jpg)
27-point stencil usedfor 3D geophysicalapplications (collabo-ration S.Operto andJ.Virieux, Geoazur).
![Page 66: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/66.jpg)
1D example
I Consider the problem
−u′′(x) = f (x) for x ∈ (0, 1)u(0) = u(1) = 0
I xi = i × h, i = 0, . . . , n + 1, f (xi ) = fi , u(xi ) = ui
h = 1/(n + 1)
I Centered difference approximation:
−ui−1 + 2ui − ui+1 = h2fi (u0 = un+1 = 0),
I We obtain a linear system Au = f or (for n = 6):
1
h2
2 −1 0 0 0 0−1 2 1 0 0 0
0 −1 2 1 0 00 0 −1 2 1 00 0 0 −1 2 10 0 0 0 −1 2
u1
u2
u3
u4
u5
u6
=
f1
f2
f3
f4
f5
f6
64/ 627
![Page 67: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/67.jpg)
Slightly more complicated (2D)
Consider an elliptic PDE:
−∂(a(x , y)∂u
∂x )
∂x−∂(b(x , y)∂u
∂y )
∂y+ c(x , y)× u = g(x , y) sur Ω
u(x , y) = 0 sur ∂Ω
0 ≤ x , y ≤ 1
a(x , y) > 0
b(x , y) > 0
c(x , y) ≥ 0
65/ 627
![Page 68: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/68.jpg)
I Case of a regular 2D mesh:
0 1
1
2 3 41
5
discretization step: h = 1n+1 , n = 4
I 5-point finite difference scheme:
∂(a(x , y)∂u∂x )ij
∂x=
ai+ 12,j(ui+1,j − ui ,j)
h2−
ai− 12,j(ui ,j − ui−1,j)
h2+O(h2)
I Similarly:
∂(b(x , y)∂u∂y )ij
∂y=
bi ,j+ 12(ui ,j+1 − ui ,j)
h2−
bi ,j− 12(ui ,j − ui ,j−1)
h2+O(h2)
![Page 69: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/69.jpg)
I ai+ 12,j , bi+ 1
2,j , cij , . . . known.
I With the ordering of unknows of the example, we obtain alinear system of the form:
Ax = b,
I where
x1 ↔ u1,1 = u( 1n+1 ,
1n+1 )
x2 ↔ u2,1 = u( 2n+1 ,
1n+1 )
x3 ↔ u3,1
x4 ↔ u4,1
x5 ↔ u1,2, . . .
I and A is n2 by n2, b is of size n2, with the following structure:
67/ 627
![Page 70: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/70.jpg)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16|x x x | 1 |g11||x x x x | 2 |g21|| x x x x | 3 |g31|| x x 0 x | 4 |g41||x 0 x x x | 5 |g12|| x x x x x | 6 |g22|| x x x x x | 7 |g32|
A=| x x x 0 x | 8 b=|g42|| x 0 x x x | 9 |g13|| x x x x x |10 |g23|| x x x x x |11 |g33|| x x x 0 x |12 |g43|| x 0 x x |13 |g14|| x x x x |14 |g24|| x x x x |15 |g34|| x x x |16 |g44|
68/ 627
![Page 71: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/71.jpg)
Solution of the linear system
Often the most costly part in a numerical simulation code
I Direct methods:I L U factorization followed by triangular substitutionsI parallelism depends highly on the structure of the matrix
I Iterative methods:I usually rely on sparse matrix-vector products (can be done in
parallel)I algebraic preconditioner useful
→Need for high-performance linear algebra kernels
69/ 627
![Page 72: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/72.jpg)
Evolution of a complex phenomenon
I Examples:I climate modeling, evolution of radioactive waste, . . .
I heat equation:
∆u(x , y , z , t) = ∂u(x,y ,z,t)
∂tu(x , y , z , t0) = u0(x , y , z)
I Discretization in both space and time (1D case):I Explicit approaches:
un+1j −un
j
tn+1−tn=
unj+1−2un
j +unj−1
h2 .I Implicit approaches:
un+1j −un
j
tn+1−tn=
un+1j+1−2un+1
j +un+1j−1
h2 .
I Implicit approaches are preferred (more stable, larger timesteppossible) but are more numerically intensive: a sparse linearsystem must be solved at each iteration.
Need for high performance linear algebra kernels
70/ 627
![Page 73: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/73.jpg)
Discretization with Finite elements
I Consider a partial differential equation of the form (PoissonEquation):
∆u = ∂2u∂x2 + ∂2u
∂y2 = f
u = 0 on ∂Ω
I we can show (using Green’s formula) that the previousproblem is equivalent to:
a(u, v) = −∫
Ωf v dx dy ∀v such that v = 0 on ∂Ω
where a(u, v) =∫
Ω
(∂u∂x
∂v∂x + ∂u
∂y∂v∂y
)dxdy
71/ 627
![Page 74: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/74.jpg)
Finite element scheme: 1D Poisson Equation
I ∆u = ∂2u∂x2 = f , u = 0 on ∂Ω
I Equivalent to
a(u, v) = g(v) for all v (v|∂Ω = 0)
where a(u, v) =∫
Ω∂u∂x
∂v∂x and g(v) = −
∫Ω f (x)v(x)dx
(1D: similar to integration by parts)
I Idea: we search u of the form =∑
k αkΦk(x)
(Φk)k=1,n basis of functions such that Φk is linear on all Ei ,
and Φk(xi ) = δik = 1 if k = i , 0 otherwise.
Ω
Φk Φk+1Φk−1
xkEk Ek+1
72/ 627
![Page 75: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/75.jpg)
Finite Element Scheme: 1D Poisson Equation
Ω
Φk Φk+1Φk−1
xkEk Ek+1
I We rewrite a(u, v) = g(v) for all Φk :a(u,Φk) = g(Φk) for all k ⇔
∑i αia(Φi ,Φk) = g(Φk)
a(Φi ,Φk) =∫
Ω∂Φi∂x
∂Φk∂x = 0 when |i − k| ≥ 2
I kth equation associated with Φk
αk−1a(Φk−1,Φk) + αka(Φk ,Φk) + xk+1a(Φk+1,Φk) = g(Φk)
I a(Φk−1,Φk) =∫Ek
∂Φk−1
∂x∂Φk∂x
a(Φk+1,Φk) =∫Ek+1
∂Φk+1
∂x∂Φk∂x
a(Φk ,Φk) =∫Ek
∂Φk∂x
∂Φk∂x +
∫Ek
∂Φk∂x
∂Φk∂x
73/ 627
![Page 76: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/76.jpg)
Finite Element Scheme: 1D Poisson Equation
From the point of view of Ek , we have a 2x2 contribution matrix:( ∫Ek
∂Φk−1
∂x∂Φk−1
∂x
∫Ek
∂Φk−1
∂x∂Φk∂x∫
Ek
∂Φk−1
∂x∂Φk∂x
∫Ek
∂Φk∂x
∂Φk∂x
)=
(IEk
(Φk−1,Φk−1) IEk(Φk−1,Φk)
IEk(Φk ,Φk−1) IEk
(Φk ,Φk)
)
210 3 4 Ω
E3
Φ1 Φ2 Φ3
E4E1 E2 IE1 (Φ1,Φ1) + IE2 (Φ1,Φ1) IE2 (Φ1,Φ2)IE2 (Φ2,Φ1) IE2 (Φ2,Φ2) + IE3 (Φ2,Φ2) IE3 (Φ2,Φ3)
IE3 (Φ2,Φ3) IE3 (Φ3,Φ3) + IE4 (Φ3,Φ3)
×
α1
α2
α3
=
g(φ1)g(φ2)g(φ3)
74/ 627
![Page 77: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/77.jpg)
Finite Element Scheme in Higher Dimension
I Can be used for higher dimensions
I Mesh can be irregular
I Φi can be a higher degree polynomial
I Matrix pattern depends on mesh connectivity/ordering
75/ 627
![Page 78: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/78.jpg)
Finite Element Scheme in Higher Dimension
I Set of elements (tetrahedras, triangles) to assemble:
j
T
i
k C (T ) =
aTi ,i aT
i ,j aTi ,k
aTj ,i aT
j ,j aTj ,k
aTk,i aT
k,j aTk,k
Needs for the parallel case
I Assemble the sparse matrix A =∑
i C (Ti ): graph coloringalgorithms
I Parallelization domain by domain: graph partitioning
I Solution of Ax = b: high performance matrix computationkernels
75/ 627
![Page 79: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/79.jpg)
Other example: linear least squares
I mathematical model + approximate measures ⇒ estimateparameters of the model
I m ”experiments” + n parameters xi :min‖Ax − b‖ avec:
I A ∈ Rm×n,m ≥ n: data matrixI b ∈ Rm: vector of observationsI x ∈ Rn: parameters of the model
I Solving the problem:I Factorisation sous la forme A = QR, avec Q orthogonale, R
triangulaireI ‖Ax−b‖ = ‖QT Ax−QT b‖ = ‖QT QRx−QT b‖ = ‖Rx−QT b‖
I Problems can be large (meteorological data, . . . ), sparse ornot
→ Again, we need high performance algorithms
76/ 627
![Page 80: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/80.jpg)
Software aspects, parallelization of industrialsimulation codes
I Distinction betweenI Porting codes and optimizing them on SMP machines
I Local changes in the codeI No major change in the global resolution methodI Possible substitution of computational kernels
I Development of a parallel code for distributed memorymachines → different algorithms needed
I Development of optimized parallel libraries (ex: solvers forlinear systems) where portability and efficiency are essential
I How to take the characteristics of a parallel machine intoaccount ?
I Some of the most efficient sequential algorithms cannot beparallelized
I Some algorithms suboptimal in sequential are very good inparallel
I Major problem: How to reuse existing codes ?
77/ 627
![Page 81: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/81.jpg)
IntroductionIntroduction aux calculateurs haute-performanceEvolutions architecturalesProgrammationCalcul Matriciel et calcul haute performanceGrid computing - Internet ComputingConclusion
78/ 627
![Page 82: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/82.jpg)
Grid computing - Internet Computing
Internet peut servir de support a l’execution d’applicationsreparties en plus de sa fonction d’acces a l’information.
InteretI Interface familiereI Disponibilite d’outils de base :
I Espace universel de designation (URL)I Protocole de transfert de l’information (HTTP)I Gestion d’information sous format standard (HTML-XML)
I Web = systeme d’exploitation primitif pour applicationsreparties ?
79/ 627
![Page 83: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/83.jpg)
Grid computing - Internet Computing
Internet peut servir de support a l’execution d’applicationsreparties en plus de sa fonction d’acces a l’information.
ProblemesI Ou et comment sont executes les programmes ?
I Sur le site serveur → scripts CGI, servlets, . . .I Sur le site client → scripts dans extension du navigateur
(plugin) ou applets, . . .
I Comment assurer la securite ?I Probleme majeur pas completement resolu
I Protection des sitesI Encryptage de l’informationI Restrictions sur les conditions d’execution
I TracabiliteI Mais finalement qui beneficie du resultat de l’execution du
service ?
79/ 627
![Page 84: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/84.jpg)
Grid Computing
I Rendre accessibles de facon transparente des ressources sur leNet : capacites de traitement, logiciels d’expertise, bases dedonnees, . . .
3 types de grilles: partage d’information, stockage, calculI Problemes :
I Localiser et renvoyer les solutions - ou logiciels - sous formedirectement exploitable et usuelle a l’utilisateur
I Exemples : NetSolve, DIET, Globus, NEOS, Ninf, Legion, . . .I Mecanismes mis en œuvre :
I Sockets, RPC, client-serveur, HTTP, Corba, scripts CGI, Java,. . .
I Appels a partir de codes C ou FortranI Eventuellement interfaces plus interactives : consoles Java,
pages HTML, . . .
I Initiatives americaines, europeennes : EuroGrid (CERN + ESA+ . . . ) , GRID 5000, . . .
80/ 627
![Page 85: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/85.jpg)
Grilles de calcul : tentative de classification (T.Priol, INRIA)
I Multiplicite de termes : P2P Computing, Metacomputing,Virtual Supercomputing, Desktop Grid, Pervasive Computing,Utility Computing, Mobile Computing, Internet Computing,PC Grid Computing, On Demand Computing, . . .
I Virtual Supercomputing : grilles de supercalculateurs ;
I Desktop Grid, Internet Computing : grille composee d’un tresgrand nombre de PC (10,000 - 1,000,000);
I Metacomputing: association de serveurs applicatifs;
I P2P Computing : infrastructure de calcul Pair-a-Pair:chaque entite peut etre alternativement client ou serveur.
81/ 627
![Page 86: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/86.jpg)
Vision de la “grille aux USA”.
82/ 627
![Page 87: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/87.jpg)
Peer-to-Peer : SETI@home
I 500,000 PCs a la recherche d’intelligence extra-terrestre
I Analyse du signal
I Pair recupere un jeu de donnees depuis le radio-telescoped’Arecibo
I Pair analyse les donnees (300 kB, 3TFlops, 10 hours) quandils sont inactifs
I Les resultats sont transmis a l’equipe SETI
I 35 TFlop/s en moyenne
I Source d’inspiration pour de nombreuses entreprises
83/ 627
![Page 88: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/88.jpg)
Peer-to-Peer : SETI@home
Total Last 24 HoursUsers 5436301 0 new users
Results received 2005637370 780175Total CPU time 2378563.061 years 539.796 years
Flops 7.406171e+21 3.042682e+18
I 500,000 PCs a la recherche d’intelligence extra-terrestre
I Analyse du signal
I Pair recupere un jeu de donnees depuis le radio-telescoped’Arecibo
I Pair analyse les donnees (300 kB, 3TFlops, 10 hours) quandils sont inactifs
I Les resultats sont transmis a l’equipe SETI
I 35 TFlop/s en moyenne
I Source d’inspiration pour de nombreuses entreprises
83/ 627
![Page 89: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/89.jpg)
Google (d’apres J. Dongarra)
I 2600 requetes par seconde (200× 106 par jour)
I 100 pays
I 8× 109 documents indexes
I 450,000 systemes Linux dans plusieurs centres de donnees
I Consommation electrique 20 MW (2 millions de $ par mois)
I Ordre d’apparence des pages ⇔ valeurs propres d’une matricede probabilite de transition (1 entre page i et j signifiel’existence d’un lien de i vers j)
84/ 627
![Page 90: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/90.jpg)
RPC et Grid Computing : Grid RPC (F. Desprez,INRIA)
I Idee simple:I Construire le modele de programmation RPC sur la grilleI utiliser les ressources (donnees+services) disponibles sur le
reseauI Parallelisme mixte : guide par les donnees au niveau du serveur
et par les taches entre les serveurs.
I Fonctionnalites requises:
1. Equilibrage de charge (localisation services, evaluation deperformance, sequencement)
2. IDL (Interface Definition Language)3. Mecanismes pour gerer la persistence et la dupplication des
donnees.4. Securite, Tolerance aux pannes, Interoperabilite entre
intergiciels (gridRPC)
85/ 627
![Page 91: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/91.jpg)
RPC et Grid Computing : Grid RPC (suite)
I Exemples:I Netsolve (Univ. Tennessee) (le plus ancien, base sur des
sockets)I DIET: Equipe Graal (LIP)
Outil recent largement utiliseTravaux importants sur l’ordonnancement des taches, ledeploiement, la gestion des donnees.
86/ 627
![Page 92: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/92.jpg)
IntroductionIntroduction aux calculateurs haute-performanceEvolutions architecturalesProgrammationCalcul Matriciel et calcul haute performanceGrid computing - Internet ComputingConclusion
87/ 627
![Page 93: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/93.jpg)
Evolutions du Calcul Haute-Performance
I Memoire virtuellement partagee :I clustersI Hierarchie memoire plus etendue
I Clusters de machinesI Souvent a base de PCs (Pentium ou Dec Alpha, NT ou
LINUX)
I Programmation parallele (memoire partagee, transfert demessage, data parallele) :
I Efforts de definition de standards : Open MP et threadsPOSIX, MPI, HPF, . . .
I MPPs et clustersI representent l’avenir pour le calcul haute-performanceI rapport communications
puissance de calcul souvent faible par rapport aux
multiprocesseurs a memoire partageeI integration dans l’ensemble des moyens de calcul d’une
entreprise de plus en plus courante
88/ 627
![Page 94: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/94.jpg)
Environnements de programmation
I On n’evitera pas le calcul paralleleI Logiciels ont toujours un temps de retard / aux architectures
I Systeme d’exploitationI Parallelisation automatiqueI Logiciels applicatifs et librairies scientifiques
I Pour des architectures massivement paralleles :I Standard de programmation : MPI ou MPI + threads
(POSIX/OpenMP)I Langages: le plus souvent C ou FortranI Besoins d’outils de developement (debogueurs, compilateurs,
analyseurs de performance, librairies, . . . )I Developpements/maintenance difficiles et difficultes
d’utilisation des outils de mise au point.
89/ 627
![Page 95: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/95.jpg)
Resolution de problemes issus du calcul scientifique
I Calcul parallele necessaire pour resoudre des problemes detailles ”raisonnables”
I Calculs impliquant des matrices souvent les plus critiques enmemoire/en temps
I Besoins: methodes numeriques paralleles, algebre lineairedense et creuse, algorithmes de traitement de graphes
I Les algorithmes doivent s’adapter:I aux architectures parallelesI aux modeles de programmation
I portabilite et efficacite ?la meilleure facon d’obtenir un programme parallele est deconcevoir un algorithme parallele !!!!
90/ 627
![Page 96: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/96.jpg)
HPC Spectrum (d’apres J.Dongarra)
Peer
to p
eer
(SET
I@ho
me)
Grid
−bas
ed co
mpu
ting
Net
wor
k of
ws
Beow
ulf c
luste
rCl
uste
rs w
/
Para
llel d
ist m
emTF
lop
mac
hine
s
spec
ial i
nter
conn
ect
Distributed Systems
- Gather (unused) resources- Steal cycles- System software managesresources- 10% - 20% overhead is OK- Resources drive applications- Completion time not critical- Time-shared- Heterogeneous
Massively // Systems
- Bounded set of resources- Apps grow to consume all cycles- Application manages resources- 5% overhead is maximum- Apps drive purchase of equipment- Real-time constraints- Space-shared- Homogeneous 91/ 627
![Page 97: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/97.jpg)
Outline
Calculateurs haute-performance: concepts generauxIntroductionOrganisation des processeursOrganisation memoireOrganisation interne et performance des processeurs vectorielsOrganisation des processeurs RISCReutilisation des donnees (dans les registres)Memoire cacheReutilisation des donnees (dans les caches)Memoire virtuelleReutilisation des donnees (en memoire)Interconnexion des processeursLes supercalculateurs du top 500 en Juin 2007Conclusion
92/ 627
![Page 98: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/98.jpg)
Calculateurs haute-performance: concepts generauxIntroductionOrganisation des processeursOrganisation memoireOrganisation interne et performance des processeurs vectorielsOrganisation des processeurs RISCReutilisation des donnees (dans les registres)Memoire cacheReutilisation des donnees (dans les caches)Memoire virtuelleReutilisation des donnees (en memoire)Interconnexion des processeursLes supercalculateurs du top 500 en Juin 2007Conclusion
93/ 627
![Page 99: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/99.jpg)
Introduction
I Conception d’un supercalculateurI Determiner quelles caracteristiques sont importantes (domaine
d’application)I Maximum de performance en respectant les contraintes de
cout (achat, maintenance,consommation)I Conception d’un processeur :
I Jeu d’instructionsI Organisation fonctionnelle et logiqueI Implantation (integration, alimentation, . . . )
I Exemples de contraintes fonctionnelles vs domained’application
I Machine generaliste : performance equilibree sur un largeensemble de traitements
I Calcul scientifique : arithmetique flottante performanteI Gestion : base de donnees, transactionnel, . . .
94/ 627
![Page 100: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/100.jpg)
I Utilisation des architecturesI Besoins toujours croissants en volume memoire :
x 1.5 – 2 par an pour un code moyen (soit 1 bit d’adresse tousles 1-2 ans)
I 25 dernieres annees remplacement assembleur par langages dehaut niveau → compilateurs / optimisation de code
I Evolution technologiqueI Circuits integres (CPU) : densite + 50 % par anI Semi-conducteurs DRAM (memoire) :
I Densite + 60 % par anI Temps de cycle : -30 % tous les 10 ansI Taille : multipliee par 4 tous les 3 ans
95/ 627
![Page 101: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/101.jpg)
CPU performance
I CPUtime = #ProgramCyclesClockRate
I #ProgramCycles =#ProgramInstructions × avg .#cyclesperinstruction
I Thus performance (CPUtime) depends on three factors :
1. clock cycle time2. #cycles per instruction3. number of instructions
But those factors are inter-dependent:
I ClockRate depends on hardware technology and processororganization
I #cyclesperinstruction depends on organization and instructionset architecture
I #instructions depends on instruction set and compiler
96/ 627
![Page 102: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/102.jpg)
Calculateurs haute-performance: concepts generauxIntroductionOrganisation des processeursOrganisation memoireOrganisation interne et performance des processeurs vectorielsOrganisation des processeurs RISCReutilisation des donnees (dans les registres)Memoire cacheReutilisation des donnees (dans les caches)Memoire virtuelleReutilisation des donnees (en memoire)Interconnexion des processeursLes supercalculateurs du top 500 en Juin 2007Conclusion
97/ 627
![Page 103: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/103.jpg)
Pipeline
I Pipeline = principe du travail a la chaıneI un traitement est decoupe en un certain nombre de
sous-traitements realises par des unites differentes (etages dupipeline)
I les etages fonctionnent simultanement sur des operandesdifferents (elements de vecteurs par exemple)
I apres amorcage du pipeline, on obtient un resultat par tempsde cyle de base
I Processeur RISC :I Pipeline sur des operations scalaires independantes :
a = b + cd = e + f
I Code executable plus complexe sur RISC :
do i = 1, na(i) = b(i) + c(i)
enddo
98/ 627
![Page 104: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/104.jpg)
I Code correspondant :
i = 1boucle : load b(i) dans registre #1
load c(i) dans registre #2registre #3 = registre #1 + registre #2store registre #3 dans a(i)i = i + 1 et test fin de boucle
I Exploitation du pipeline → deroulage de boucle
do i = 1, n, 4a(i ) = b(i ) + c(i )a(i+1) = b(i+1) + c(i+1)a(i+2) = b(i+2) + c(i+2)a(i+3) = b(i+3) + c(i+3)
enddo
99/ 627
![Page 105: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/105.jpg)
I Sur processeur vectoriel :
do i = 1, na(i) = b(i) + c(i)
enddo
load vector b dans registre #1load vector c dans registre #2register #3 = register #1 + register #2store registre #3 dans vecteur a
Stripmining : si n > nb (taille registres vectoriels)
do i = 1, n, nbib = min( nb, n-i+1 )do ii = i, i + ib - 1
a(ii) = b(ii) + c(ii)enddo
enddo
100/ 627
![Page 106: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/106.jpg)
Problemes dans la conception des pipelines
I Beaucoup d’etages:I cout d’amorcage plus eleveI performances plus sensibles a la capacite de nourrir le pipelineI permet de reduire le temps de cycle
I
I Moins d’etagesI sous-instructions plus complexesI plus difficile de decroıtre le temps de cycle
101/ 627
![Page 107: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/107.jpg)
Problemes des dependences de donnees
I Exemple :
do i = 2, na(i) = a(i-1) + 1
enddo
a(i) initialises a 1.
I Execution scalaire :
Etape 1 : a(2) = a(1) + 1 = 1 + 1 = 2
Etape 2 : a(3) = a(2) + 1 = 2 + 1 = 3
Etape 3 : a(4) = a(3) + 1 = 3 + 1 = 4.....
102/ 627
![Page 108: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/108.jpg)
I Execution vectorielle : pipeline a p etages → p elements dansle pipeline
Etages du pipe-------------------------------------------
Temps 1 2 3 ... p sortie-------------------------------------------------------t0 a(1)t0 + dt a(2) a(1)t0 + 2dt a(3) a(2) a(1)....t0 + pdt a(p+1) a(p) ... a(2) a(1)-------------------------------------------------------
D’ou :
a(2) = a(1) + 1 = 1 + 1 = 2a(3) = a(2) + 1 = 1 + 1 = 2...
car on utilise la valeur initiale de a(2).
Resultat execution vectorielle 6= execution scalaire
103/ 627
![Page 109: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/109.jpg)
Overlapping (recouvrement)
I Utiliser des unites fonctionnelles en parallele sur desoperations independantes. Exemple:
do i = 1, nA(i) = B(i) * C(i)D(i) = E(i) + F(i)
enddo
A
DE
F
B
C
Pipelined multiplier
Pipelined adder
Timeoverlapping = maxStartupmul ,Startupadd + dt+ n×dt
Timeno overlap. = Startupmul +n×dt+Startupadd +n×dtI Avantages: parallelisme entre les unites fonctionnelles
independantes et plus de flops par cycle
104/ 627
![Page 110: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/110.jpg)
Chaining (chaınage)
I La sortie d’une unite fonctionnelle est dirigee directement versl’entree d’une autre unite fonctionnelle
I Exemple :
do i = 1, nA(i) = ( B(i) * C(i) ) + D(i)
enddo
D
A
Pipelined multiplier Pipelined adderB
C
Timechaining = Startupmul + Startupadd + n × dtTimenochaining = Startupmul + n×dt+Startupadd + n×dt
I Avantages : plus de flops par cyle, exploitation de la localitedes donnees, economie de stockage intermediaire
105/ 627
![Page 111: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/111.jpg)
Calculateurs haute-performance: concepts generauxIntroductionOrganisation des processeursOrganisation memoireOrganisation interne et performance des processeurs vectorielsOrganisation des processeurs RISCReutilisation des donnees (dans les registres)Memoire cacheReutilisation des donnees (dans les caches)Memoire virtuelleReutilisation des donnees (en memoire)Interconnexion des processeursLes supercalculateurs du top 500 en Juin 2007Conclusion
106/ 627
![Page 112: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/112.jpg)
Locality of references
Programs tend to reuse data and instructions recently used
I Often program spends 90% of its time in only 10% of code.
I Also applies - not as strongly - to data accesses :
I temporal locality : recently accessed items are likely to beaccessed in the future
I spatial locality : items whose addresses are near one anothertend to be referenced close together in time.
107/ 627
![Page 113: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/113.jpg)
Concept of memory hierarchy - 1
In hardware : smaller is faster
Example :
I On a high-performance computer using same technology(pipelining, overlapping, . . . ) for memory:
I signal propagation is a major cause of delay thus largermemories → more signal delay and more levels to decodeaddresses.
I smaller memories are faster because designer can use morepower per memory cell.
108/ 627
![Page 114: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/114.jpg)
Concept of memory hierarchy - 2
Make use of principle of locality of references
I Data most recently used - or nearby data - are very likely tobe accessed again in the future
I Try to have recently accessed data in the fastest memory
I Because smaller is faster → use smaller memories to holdmost recently used items close to CPU and successively largermemories farther away from CPU
→ Memory hierarchy
109/ 627
![Page 115: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/115.jpg)
Typical memory hierarchy
access bandwidthLevel Size time MB/s technology manag.Registers ≤ 1KB 2-5 ns 400-32,000 (BI)CMOS compilerCache ≤ 4MB 3-10 ns 800-5,000 CMOS SRAM hardwareMain memory ≤ 4GB 80-400 ns 400-2,000 CMOS DRAM OSDisk ≥ 1GB 5 ×106 ns 4-32 magnetic disk OS/user
110/ 627
![Page 116: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/116.jpg)
Speed of computation relies on high bandwithHigh memory bandwidthSpeed of computation
Adder
MemoryX <--- Y + Z
Flow requirementData
Intruction
NI LI NO Cycle time
Required(in nsec)
Bandwidth
Digital α 21064
1 CPU CRAY-C90
Intel i860 XP 2 1/2 3/2 20 0.275GW/sec
2 1/2 1
4
5
4.2
0.6GW/sec
2.8GW/sec
1 CPU NEC SX3/14 16 2.9 16GW/sec
Y Z
X
Example:
cycle time
(NI*LI+3*NO) (in Words/sec)Bandwidth required:
Intruction
NI = Nb. Instructions/cycle
1 Word = 8 Bytes
LI = Nb. Words/Instruction
NO = Nb. Operations/cycle
111/ 627
![Page 117: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/117.jpg)
Memory interleaving
Banks
1
2
3
4
5
6
7
8
Banks
1
2
3
4
5
6
7
8
a(1), a(9), ..., a(249)
a(2), a(10), ..., a(250)
a(3),a(11), ..., a(251)
a(4),...
a(5), ...
a(6), ...a(7), ..., a(255)
a(8), a(16), ..., a(256)
Two basic ways of distributing the addresses
Memory size 210
=1024 Words divided into 8 banks
a(1), a(2), ..., a(128)
a(129), ..., a(256)
Low order interleaving
Real a(256)
"well adapted to pipelining memory access"
Memory Interleaving
"The memory is subdivided into several independent memory modules (banks)"
Example:
High order interleaving
112/ 627
![Page 118: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/118.jpg)
Effect of bank cycle time
1
3
2
Bank
4
1
3
2
Bank
4
... = a(i,j)
EnddoEnddo
Do i=1,4
Real a(4,2)
Do j=1,2
... = a(i,j)Enddo
Real a(4,2)
Do i=1,4Do j=1,2
Enddo
cannot be referenced again
Time interval during which the bank
Example
a(1,1) a(1,2)
1 CP
Low order interleaved memory, 4 banks, bank cycle time 3CP.
% column access %row access
Bank cycle time:
10 Clock Period 18 Clock Period
Bank Conflict: Consecutive accesses to the same bank in less than bank cycle time.
Stride: Memory address interval between successive elements
time
a(3,1)
a(4,1) a(4,2)
a(2,2)a(2,1)
a(1,2)a(1,1)
a(3,2)
a(4,1) a(4,2)
a(2,1) a(2,2)
a(3,1) a(3,2)
113/ 627
![Page 119: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/119.jpg)
Calculateurs haute-performance: concepts generauxIntroductionOrganisation des processeursOrganisation memoireOrganisation interne et performance des processeurs vectorielsOrganisation des processeurs RISCReutilisation des donnees (dans les registres)Memoire cacheReutilisation des donnees (dans les caches)Memoire virtuelleReutilisation des donnees (en memoire)Interconnexion des processeursLes supercalculateurs du top 500 en Juin 2007Conclusion
114/ 627
![Page 120: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/120.jpg)
Organisation interne et performance des processeursvectoriels (d’apres J. Dongarra)
I Soit l’operation vectorielle triadique :
do i = 1, ny(i) = alpha * ( x(i) + y(i) )
enddoI On a 6 operations :
1. Load vecteur x2. Load vecteur y3. Addition x + y4. Multiplication alpha × ( x + y )5. Store dans vecteur y
115/ 627
![Page 121: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/121.jpg)
I Organisations de processeur considerees :
1. Sequentielle2. Arithmetique chaınee3. Load memoire et arithmetique chaınees4. Load memoire, arithmetique et store memoire chaınes5. Recouvrement des loads memoire et operations chaınees
I Notations :
a : startup pour load memoireb : startup pour additionc : startup pour multiplicationd : startup pour store memoire
116/ 627
![Page 122: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/122.jpg)
Sequential Machine Organization
a
a
b
c
d
memory path busy
load x
load y
add.
mult.
store
Chained Arithmetic
a load x
a load y
b add.
c mult.
d store
memory path busy
117/ 627
![Page 123: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/123.jpg)
a
a
memory path busy
load x
load y
a load x
a load y
memory path busy
Chained Load and Arithmetic
b add.
mult.c
d store
Chained Load, Arithmetic and Store
add. b
c mult.
d store
118/ 627
![Page 124: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/124.jpg)
a load x
Overlapped Load with Chained Operations
a load y
b add.
c mult.
stored
memory path 2 busy
memory path 3 busy
memory path 1 busy
119/ 627
![Page 125: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/125.jpg)
Calculateurs haute-performance: concepts generauxIntroductionOrganisation des processeursOrganisation memoireOrganisation interne et performance des processeurs vectorielsOrganisation des processeurs RISCReutilisation des donnees (dans les registres)Memoire cacheReutilisation des donnees (dans les caches)Memoire virtuelleReutilisation des donnees (en memoire)Interconnexion des processeursLes supercalculateurs du top 500 en Juin 2007Conclusion
120/ 627
![Page 126: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/126.jpg)
Organisation des processeurs RISC
The execution pipeline
Instruction
Decode
Instruction
FetchExecution
Memory access
and branch
completion
(write results
in register file)
Write back
Example (DLX processor, Hennessy and Patterson, 96 [?])
I Pipeline increases the instruction throughputI Pipeline hazards: prevents the next instruction from executing
I Structural hazards: arising from hardware resource conflictsI Data hazards: due to dependencies between instructionsI Control hazards: branches for example
121/ 627
![Page 127: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/127.jpg)
Instruction Level Parallelism (ILP)
I Pipelining: overlap execution of independent operations →Instruction Level Parallelism
I Techniques for increasing amount of parallelism amonginstructions:
I reduce the impact of data and control hazardsI increase the ability of processor to exploit parallelismI compiler techniques to increase ILP
I Main techniquesI loop unrollingI basic and dynamic pipeline schedulingI dynamic branch predictionI Issuing multiple instructions per cycleI compiler dependence analysisI software pipeliningI trace scheduling / speculationI . . .
122/ 627
![Page 128: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/128.jpg)
Instruction Level Parallelism (ILP)
I Simple and common way to increase amount of parallelism isto exploit parallelism among iterations of a loop : Loop LevelParallelism
I Several techniques :I Unrolling a loop statically by compiler or dynamically by the
hardwareI Use of vector instructions
123/ 627
![Page 129: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/129.jpg)
ILP: Dynamic scheduling
I Hardware rearranges the instruction execution to reduce thestalls.
I Advantage: handle cases where dependences are unknown atcompile time and simplifies the compiler
I But: significant increase in hardware complexity
I Idea: execute instructions as soon as their data are availableOut-of-order execution
I Handling exceptions becomes tricky
124/ 627
![Page 130: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/130.jpg)
ILP: Dynamic scheduling
I Scoreboarding: technique allowing instruction out-of-orderexecution when resources are sufficient and when no datadependences
I full responsability for instruction issue and execution
I goal : try to maintain an execution rate of one instruction /clock by executing instructions as early as possible
I requires multiple instructions to be in the EX stagesimultaneously → multiple functional units and/or pipelinedunits
I Scoreboard table record/update data dependences + status offunctional units
I Limits:I amount of parallelism available between instructionsI number of scoreboard entries: set of instructions examined
(window)I number and type of functional units
125/ 627
![Page 131: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/131.jpg)
ILP: Dynamic scheduling
I Other approach : Tomasulo’s approach (register renaming)
I Suppose compiler has issued:
F10 <- F2 x F2F2 <- F0 + F6
I Rename F2 to F8 in the second instruction (assuming F8 isnot used)
F10 <- F2 x F2F8 <- F0 + F6
I Can be used in conjunction with scoreboarding
126/ 627
![Page 132: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/132.jpg)
ILP : Multiple issue
I CPI cannot be less than one except if more than oneinstruction issued each cycle → multiple-issue processors(CPI: average nb of cycles per instruction)
I Two types :I superscalar processorsI VLIW processors (Very Long Instruction Word)
I Superscalar processors issue varying number of instructionsper cycle either statically scheduled by compiler ordynamically (e.g. using scoreboarding). Typically 1 - 8instructions per cycle with some constraints.
I VLIW issue a fixed number of instructions formatted either asone large instruction or as a fixed instruction packet :inherently statically scheduled by compiler
127/ 627
![Page 133: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/133.jpg)
Impact of ILP : example
This example is from J.L. Hennessy and D.A. Patterson (1996) [?].
I Original Fortran code
do i = 1000, 1x(i) = x(i) + temp
enddo
I Pseudo-assembler code
R1 <- address(x(1000))load temp -> F2
Loop : load x(i) -> F0F4 = F0 + F2store F4 -> x(i)R1 = R1 - #8 % decrement pointerBNEZ R1, Loop % branch until end of loop
128/ 627
![Page 134: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/134.jpg)
I Architecture
IF ID MEM WB
Integer Unit1 stage
FP add
FP mult
Dividenot pipelined
4 stages
4 stages
Example of pipelined processor (DLX processor, Hennessy andPatterson, 96 [?])
129/ 627
![Page 135: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/135.jpg)
I Latency: # cycles between instruction that produces resultand instruction that uses result
I Initiation interval : # cycles between issuing 2 instructions ofsame type
I Latency = 0 means results can be used next cycle
Functional unit Latency Initiation intervalInteger ALU 0 1Loads 1 1FP add 3 1FP mult 3 1FP divide 24 24
Characteristics of the processor
Inst. producing result Inst. using result LatencyFP op FP op 3FP op store double 2Load double FP op 1Load double store double 0
Latency between instructions
Latency FP op to store double : forwarding hardware passes result from
ALU directly to memory input. 130/ 627
![Page 136: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/136.jpg)
I Straightforward code
#cycleLoop : load x(i) -> F0 1 load lat. = 1
stall 2F4 = F0 + F2 3stall 4 FP op -> store = 2stall 5store F4 -> x(i) 6R1 = R1 - #8 7BNEZ R1, Loop 8stall 9 delayed branch 1
I 9 cycles per iteration
I Cost of calculation 9,000 cycles
I Peak performance : 1 flop/cycle
I Effective performance : 19 of peak
131/ 627
![Page 137: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/137.jpg)
I With a better scheduling
#cycleLoop : load x(i) -> F0 1 load lat. = 1
stall 2F4 = F0 + F2 3R1 = R1 - #8 4 Try keep int. unit busyBNEZ R1, Loop 5store F4 -> x(i) 6 Hide delayed branching
by store
I 6 cycles per iteration
I Cost of calculation 6,000 cycles
I Effective performance : 16 of peak
132/ 627
![Page 138: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/138.jpg)
I Using loop unrolling (depth = 4)
do i = 1000, 1, -4x(i ) = x(i ) + tempx(i-1) = x(i-1) + tempx(i-2) = x(i-2) + tempx(i-3) = x(i-3) + temp
enddo
133/ 627
![Page 139: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/139.jpg)
I Pseudo-assembler code (loop unrolling, depth=4):#cycle
Loop : load x(i) -> F0 1 1 stallF4 = F0 + F2 3 2 stallsstore F4 -> x(i) 6load x(i-1) -> F6 7 1 stallF8 = F6 + F2 9 2 stallsstore F8 -> x(i-1) 12load x(i-2) -> F10 13 1 stallF12= F10+ F2 15 2 stallsstore F12-> x(i-2) 18load x(i-3) -> F14 19 1 stallF16= F14+ F2 21 2 stallsstore F16-> x(i-3) 24R1 = R1 - #32 25BNEZ R1, Loop 26stall 27
I 27 cycles per iterationI Cost of calculation 1000
4 × 27 = 6750 cyclesI Effective performance : 1000
6750 = 15% of peak
134/ 627
![Page 140: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/140.jpg)
I Using loop unrolling (depth = 4) and scheduling
#cycleLoop : load x(i) -> F0 1
load x(i-1) -> F6 2load x(i-2) -> F10 3load x(i-3) -> F14 4F4 = F0 + F2 5F8 = F6 + F2 6F12= F10+ F2 7F16= F14+ F2 8store F4 -> x(i) 9store F8 -> x(i-1) 10store F12-> x(i-2) 11R1 = R1 - #32 12BNEZ R1, Loop 13store F16-> x(i-3) 14
I 14 cycles per iterationI Cost of calculation 1000
4 × 14 = 3500 cyclesI Effective performance : 1000
3500 = 29% of peak135/ 627
![Page 141: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/141.jpg)
I Now assume superscalar pipeline : integer and floating pointoperations can be issued simultaneously
I Using loop unrolling with depth = 5Integer inst. | Float.inst.|#cycle
___________________________________________Loop: load x(i) -> F0 | | 1
load x(i-1)-> F6 | | 2load x(i-2)-> F10| F4 =F0 +F2 | 3load x(i-3)-> F14| F8 =F6 +F2 | 4load x(i-4)-> F18| F12=F10+F2 | 5store F4 ->x(i) | F16=F14+F2 | 6store F8 ->x(i-1)| F20=F18+F2 | 7store F12->x(i-2)| | 8store F16->x(i-3)| | 9R1 = R1 - #40 | | 10BNEZ R1, Loop | | 11store F20->x(i-4)| | 12
I 12 cycles per iterationI Cost of calculation 1000
5 × 12 = 2400 cyclesI Effective performance : 1000
2400 = 42% of peakI Performance limited by balance between int. and float. instr.
136/ 627
![Page 142: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/142.jpg)
Survol des processeurs RISC
I Processeur RISC pipeline: exemple pipeline d’execution a 4etages
decode execute write result
stage #1 stage #2
fetch
stage #3 stage #4
137/ 627
![Page 143: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/143.jpg)
Survol des processeurs RISC
I Processeur RISC superscalaire :I plusieurs pipelinesI plusieurs instructions chargees + decodees + executees
simultanement
decode execute write result
stage #1 stage #2
fetch
stage #3 stage #4
write result
write result
write result
execute
execute
execute
decode
decode
decode
fetch
fetch
fetch
pipeline #1
pipeline #3
pipeline #4
pipeline #2
I souvent operation entiere / operations flottante / loadmemoire
I probleme : dependancesI largement utilises : DEC, HP, IBM, Intel, SGI, Sparc, . . .
137/ 627
![Page 144: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/144.jpg)
Survol des processeurs RISC
I Processeur RISC superpipeline :I plus d’une instruction initialisee par temps de cycleI pipeline plus rapide que l’horloge systemeI exemple : sur MIPS R4000 horloge interne du pipeline est 2
(ou 4) fois plus rapide que horloge systeme externe
I Superscalaire + superpipeline : non existant
137/ 627
![Page 145: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/145.jpg)
Exemples de processeurs RISC
Exec. pipe D-cache I-cache inst./ PeakProcessor MHz #stages (KB) (KB) cycle Perf.
DEC 21064 200 7/9 8 8 2 200DEC 21164 437 7/9 8 + 96 + 4 MB 8 4 874
HP PA 7200 120 - 2 MB ext. 2 240HP PA 8000 180 - 2 MB ext. 4 720
IBM Power 66 6 32-64 8-32 4 132IBM Power2 71.5 6 128-256 8-32 6 286
MIPS R8000 75 5/7 16+ 4 MB ext. 16 4 300MIPS R10000 195 - 32 + 4 MB ext. 32 4 390MIPS R12000 300 - 32 + 8 MB ext. 32 4 600
Pentium Pro 200 - 512 ext. 200
UltraSPARC I 167 - 16 + 512 KB ext. 16 2 334UltraSPARC II 200 - 1 MB ext. 16 2 400
138/ 627
![Page 146: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/146.jpg)
Calculateurs haute-performance: concepts generauxIntroductionOrganisation des processeursOrganisation memoireOrganisation interne et performance des processeurs vectorielsOrganisation des processeurs RISCReutilisation des donnees (dans les registres)Memoire cacheReutilisation des donnees (dans les caches)Memoire virtuelleReutilisation des donnees (en memoire)Interconnexion des processeursLes supercalculateurs du top 500 en Juin 2007Conclusion
139/ 627
![Page 147: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/147.jpg)
Reutilisation des donnees (dans les registres)
I Ameliorer l’acces aux donnees et exploiter la localite spatialeet temporelle des references memoire
I Deroulage de boucles : reduit le nombre d’acces memoire enutilisant le plus de registres possible
I Utiliser des scalaires temporaires
I Distribution de boucles : si nombre de donnees reutilisables >nombre de registres : substituer plusieurs boucles a une seule
140/ 627
![Page 148: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/148.jpg)
Deroulage de boucle
Objectif : reduire nombre d’acces memoire et ameliorer pipelineoperations flottantes.
I Produit matrice-vecteur : y ← y + At × x
do ...do ...
y(i) = y(i) + x(j)*A(j,i)enddo
enddo
I 2 variantes :I AXPY :
do j = 1, Ndo i = 1, N
...I DOT
do i = 1, Ndo j = 1, N
...141/ 627
![Page 149: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/149.jpg)
DOT variant
Processeurs RISC mieux adaptes a DOT que AXPY
do i = 1, Ntemp = 0.do j = 1, N
temp = temp + x(j)*A(j,i)enddoy(i) = y(i) + temp
enddo
Stride = 1 dans boucle la plus interne
load A(j,i)load x(j)perform x(j)*A(j,i) + temp
Ratio Flops/references memoire = 22 = 1
142/ 627
![Page 150: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/150.jpg)
Reutilisation de x(j) : deroulage a une profondeur 2
* Cleanup odd iterationi = MOD(N,2)if ( i >= 1 ) then
do j = 1, Ny(i) = y(i) + x(j)*A(j,i)
enddoend if
* Main loopimin = i + 1do i = imin, N, 2
temp1 = 0.temp2 = 0.do j = 1, N
temp1 = temp1 + A( j,i-1) * x(j)temp2 = temp2 + A( j,i ) * x(j)
enddoy(i-1) = y(i-1) + temp1y(i ) = y(i ) + temp2
enddo
143/ 627
![Page 151: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/151.jpg)
load A(j,i-1)load x(j)perform A(j, i-1 ) * x(j) + temp1load A(j,i)perform A(j,i ) * x(j) + temp2
I Ratio Flops/references memoire = 43
I Deroulage a une profondeur de 4 : 85
I Deroulage a une profondeur k : 2kk+1
144/ 627
![Page 152: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/152.jpg)
Rolled
Unrolled 2
Unrolled 4
Unrolled 8
0 200 400 600 800 1000 12005
10
15
20
25
30
35
40
45
Size
MF
lops
Performance of y = At x on HP 715/64
Figure: Effect of loop unrolling on HP 715/64
145/ 627
![Page 153: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/153.jpg)
Rolled
Unrolled 2
Unrolled 4
Unrolled 8
0 200 400 600 800 1000 12005
10
15
20
25
30
35
40
45
50
Size
MF
lops
Performance of y = At x on CRAY T3D
Figure: Effect of loop unrolling on CRAY T3D
146/ 627
![Page 154: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/154.jpg)
AXPY variant
Habituellement preferee sur processeurs vectoriels
do j = 1, Ndo i = 1, N
y(i) = y(i) + x(j)*A(j,i)enddo
enddo
Stride > 1 dans la boucle la plus interne
load A(j,i)load y(i)perform x(j)*A(j,i) + y(i)store result in y(i)
Ratio Flops/references memoire = 23
147/ 627
![Page 155: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/155.jpg)
Reutilisation de y(i) : deroulage a profondeur 2
* Cleanup odd iterationj = MOD(N,2)if ( j .GE. 1 ) then
do i = 1, Ny(i) = y(i) + x(j)*A(j,i)
enddoend if
* Main loopjmin = j + 1do j = jmin, N, 2
do i = 1, Ny(i) = y(i)+A(j-1,i)*x(j-1)+A(j,i)*x(j)
enddoenddo
![Page 156: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/156.jpg)
load y(i)load A(j-1,i)perform A(j-1,i ) * x(j-1) + y(i)load A(j,i)perform A(j,i) * x(j) + y(i)store result in y(i)
I Ratio Flops/references memoire = 1
I Deroulage a profondeur 4 → Ratio = 43
I Deroulage a profondeur p → Ratio = 2p2+p
149/ 627
![Page 157: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/157.jpg)
Calculateurs haute-performance: concepts generauxIntroductionOrganisation des processeursOrganisation memoireOrganisation interne et performance des processeurs vectorielsOrganisation des processeurs RISCReutilisation des donnees (dans les registres)Memoire cacheReutilisation des donnees (dans les caches)Memoire virtuelleReutilisation des donnees (en memoire)Interconnexion des processeursLes supercalculateurs du top 500 en Juin 2007Conclusion
150/ 627
![Page 158: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/158.jpg)
Organisation d’une memoire cache
I CacheI Buffer rapide entre les registres et la memoire principaleI Divise en lignes de cache
I Ligne de cacheI Unite de transfert entre cache et memoire principale
I Defaut de cacheI Reference a une donnee non presente dans le cacheI Strategie de choix d’une ligne a remplacer (LRU parmi les
eligibles)I Une ligne de cache contenant la donnee est chargee de la
memoire principale dans le cache
I Probleme de la coherence de cache sur les multiprocesseurs amemoire partagee
I Rangement des donnees dans les cachesI correspondance memoire ↔ emplacements dans le cache
151/ 627
![Page 159: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/159.jpg)
I Strategies les plus courantes :I “direct mapping”I “fully associative”I “set associative”
I Conception des caches :I L octets par ligne de cacheI K lignes par ensemble (K est le degre d’associativite)I N ensembles
I Correspondance simple entre l’adresse en memoire et unensemble :
I N = 1 : cache “fully associative”I K = 1 : cache “direct mapped”
152/ 627
![Page 160: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/160.jpg)
I “Direct mapping”I Chaque bloc en memoire ↔ un placement unique dans le cacheI Recherche de donnees dans cache peu couteuse (mais
remplacement couteux)I Probleme de contention entre les blocs
line
cache
main memory
I “Fully associative”I Pas de correspondance a prioriI Recherche de donnees dans cache couteuse
153/ 627
![Page 161: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/161.jpg)
I “Set associative”I Cache divise en plusieurs ensemblesI Chaque bloc en memoire peut etre dans l’une des lignes de
l’ensembleI “4-way set associative” : 4 lignes par ensemble
line
main memory
line 1line 2line 3
cache set #k
line 4
154/ 627
![Page 162: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/162.jpg)
Gestion des caches
I Cout d’un defaut de cache : entre 2 et 50 C (temps de cycle)I “Copyback”
I Pas de m-a-j lorsqu’une ligne de cache est modifiee, exceptelors d’un cache flush ou d’un defaut de cache
Memoire pas toujours a jour.Pas de probleme de coherence si les processeurs modifient des
lignes de cache independantes
I “Writethrough”I Donnee ecrite en memoire chaque fois qu’elle est modifiee
Donnees toujours a jour.Pas de probleme de coherence si les processeurs modifient des
donnees independantes
155/ 627
![Page 163: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/163.jpg)
Cache coherency problem
cache cache
Y
Processor # 2Processor # 1
X
cache line
I Cache coherency mechanisms to:I avoid processors accessing old copies of data (copyback and
writethrough)I update memory by forcing copybackI invalidate old cache lines
I Example of mechanism (snooping):I assume writethrough policyI Each processor observes the memory accesses from othersI If a write operation occurs that corresponds to a local
cacheline, invalidate local cacheline
156/ 627
![Page 164: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/164.jpg)
Cache coherency problem
cache cache
Y
Processor # 2Processor # 1
X
cache line
I Cache coherency mechanisms to:I avoid processors accessing old copies of data (copyback and
writethrough)I update memory by forcing copybackI invalidate old cache lines
I Example of mechanism (snooping):I assume writethrough policyI Each processor observes the memory accesses from othersI If a write operation occurs that corresponds to a local
cacheline, invalidate local cacheline156/ 627
![Page 165: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/165.jpg)
Processor Line size Level Size Organization miss Access /cycle
DEC 21164 32 B 1 8 KB Direct-mapped 2 C 22∗ 96 KB 3-way ass. ≥ 8 C 23∗ 1-64 MB Direct-mapped ≥ 12 C 2
IBM Power2 128 B / 1 128 KB / 4-way-ass. 8 C 2256 B 256 KB
MIPS R8000 16 B 1 16 KB Direct-mapped 7 C 22∗ 4-16 MB 4-way-ass. 50 C 2
Cache configurations on some computers.∗ : data + instruction cache
I Current trends:I Large caches of several MBytesI Several levels of cache
157/ 627
![Page 166: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/166.jpg)
Calculateurs haute-performance: concepts generauxIntroductionOrganisation des processeursOrganisation memoireOrganisation interne et performance des processeurs vectorielsOrganisation des processeurs RISCReutilisation des donnees (dans les registres)Memoire cacheReutilisation des donnees (dans les caches)Memoire virtuelleReutilisation des donnees (en memoire)Interconnexion des processeursLes supercalculateurs du top 500 en Juin 2007Conclusion
158/ 627
![Page 167: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/167.jpg)
Reutilisation des donnees (dans les caches)
Example
I cache 10 times faster than memory, hits 90% of the time.I What is the gain from using the cache ?
I Cost cache miss: tmiss
I Cost cache hit: thit = 0.1× tmiss
I Average cost:
90%(0.1× tmiss) + 10%× tmiss
I gain = tmiss×100%90%×(0.1×tmiss )+10%×tmiss
= 1(0.9×0.1)+0.1 = 1
0.19 = 5.3
(similar to Amdahl’s law)
159/ 627
![Page 168: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/168.jpg)
Reutilisation des donnees (dans les caches)
Example
I cache 10 times faster than memory, hits 90% of the time.I What is the gain from using the cache ?
I Cost cache miss: tmiss
I Cost cache hit: thit = 0.1× tmiss
I Average cost: 90%(0.1× tmiss) + 10%× tmiss
I gain = tmiss×100%90%×(0.1×tmiss )+10%×tmiss
= 1(0.9×0.1)+0.1 = 1
0.19 = 5.3
(similar to Amdahl’s law)
159/ 627
![Page 169: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/169.jpg)
Reutilisation des donnees (dans les caches)
Il est critique d’utiliser au maximum les donnees dans le cache ↔ameliorer le % de succes de cache
I Exemple : effet du % de defauts de cache sur un code donne
I Pmax performance lorsque toutes les donnees tiennent dans lecache (hit ratio = 100%). Tmin temps correspondant.
I Lecture de donnee dans le cache par une instruction etexecution : thit = 1 cycle
I Temps d’acces a une donnee lors d’un defaut de cache : tmiss
= 10 ou 20 cycles (execution instruction tmiss + thit)
I Ttotal = %hits.thit + %misses × (tmiss + thit)
I Topt = 100%× thit
I Perf =Topt
Ttotal
160/ 627
![Page 170: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/170.jpg)
Tmiss %hits Tps hits Tps misses Ttotal Perf.
100% 1.00 0.00 1.00 100%
10 99% 0.99 0.11 1.10 91%20 99% 0.99 0.22 1.21 83%
10 95% 0.95 0.55 1.50 66%20 95% 0.95 1.10 2.05 49%
Table: Effet des defauts de cache sur la performance d’un code (exprimesen pourcentages vs pas de defaut de cache).
161/ 627
![Page 171: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/171.jpg)
Efficient cache utilization: ExerciseReuse as much as possible data held in cache ↔ Improve cache hitratio
I Cache : single block of CS (cache size) wordsI When cache is full: LRU line returned to memoryI Copy-back: memory updated only when a modified block
removed from cacheI For simplicity, we assume cache line size L=1
Example from D. Gannon and F. Bodin :
do i=1,ndo j=1,n
a(j) = a(j) + b(i)enddo
enddo
1. Compute the cache hit ratio (assume n much larger than CS).
2. Propose a modification to improve the cache hit ratio.
162/ 627
![Page 172: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/172.jpg)
I Total number of memory references = 3× n2 i.e. n2 loads fora, n2 stores for a, and n2 loads for b (assuming the compiler isstupid).
I Total number of flops = n2
I Cache empty at beginning of calculations.I Inner loop:
do j=1,na(j) = a(j) + b(i)
enddo
Each iteration reads a(j) and b(i), and writes a(j)For i=1 → access to a(1:n)For i=2 → access to a(1:n)As n >> CS, a(j) no longer in cache when accessed again,therefore:
I each read of a(j) → 1 missI each write of a(j) → 1 hitI each read of b(i) → 1 hit (except the first one)
I Hit ratio = # of hitsMem.Refs = 2
3 = 66%
163/ 627
![Page 173: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/173.jpg)
blocked version
The inner loop is blocked into blocks of size nb < CS so that nbelements of a can be kept in cache and entirely updated withb(1:n).
do j=1,n,nbjb = min(nb,n-j+1) ! nb may not divide ndo i=1,n
do jj=j,j+jb-1a(jj) = a(jj) + b(i)
enddoenddo
enddo
164/ 627
![Page 174: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/174.jpg)
To clarify we load the cache explicitely; it is managed as a 1Darray : CA(0:nb)
do j=1,n,nbjb = min(nb,n-j+1)CA(1:jb) = a(j:j+jb-1)do i=1,n
CA(0) = b(i)do jj=j,j+jb-1
CA(jj-j+1) = CA(jj-j+1) + CA(0)enddo
enddoa(j:j+jb-1) = CA(1:jb)
enddo
Each load into cache is a miss, each store to cache is a hit.
165/ 627
![Page 175: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/175.jpg)
I Total memory references = 3n2
I Total misses:I load a = n
nb × nbI load b = n
nb × n
I Total = n + n2
nb
I Total hits = 3n2 − n − n2
nb = (3− 1nb )× n2 − n
Hit ratio = hitsMem.Refs ≈ 1− 1
3nb ≈ 100%if nb is large enough.
166/ 627
![Page 176: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/176.jpg)
Calculateurs haute-performance: concepts generauxIntroductionOrganisation des processeursOrganisation memoireOrganisation interne et performance des processeurs vectorielsOrganisation des processeurs RISCReutilisation des donnees (dans les registres)Memoire cacheReutilisation des donnees (dans les caches)Memoire virtuelleReutilisation des donnees (en memoire)Interconnexion des processeursLes supercalculateurs du top 500 en Juin 2007Conclusion
167/ 627
![Page 177: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/177.jpg)
Memoire virtuelle
I Memoire reelle : code et donnees doivent etre loges enmemoire centrale (CRAY)
I Memoire virtuelle : mecanisme de pagination entre lamemoire et les disques
Une pagination memoire excessive peut avoir desconsequences dramatiques sur la performance !!!!
I TLB :I Translation Lookaside Buffer : correspondance entre l’adresse
virtuelle et l’adresse reelle d’une page en memoireI TLB sur IBM Power4/5: 1024 entreesI Defaut de TLB : 36 C environ
I AIX offre la possibilite d’augmenter la taille des pages (jusqu’a16 MB) pour limiter les defauts de TLB.
168/ 627
![Page 178: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/178.jpg)
Calculateurs haute-performance: concepts generauxIntroductionOrganisation des processeursOrganisation memoireOrganisation interne et performance des processeurs vectorielsOrganisation des processeurs RISCReutilisation des donnees (dans les registres)Memoire cacheReutilisation des donnees (dans les caches)Memoire virtuelleReutilisation des donnees (en memoire)Interconnexion des processeursLes supercalculateurs du top 500 en Juin 2007Conclusion
169/ 627
![Page 179: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/179.jpg)
Exercice sur la reutilisation des donnees (enmemoire)
(inspire de (Dongarra, Duff, Sorensen, van der Vorst [?]))C ← C + A× BA, B, C : matrices n × n, n = 20000, stockees par colonnes
I Calculateur vectoriel (Performance de crete 50 GFlop/s)
I Memoire virtuelle (remplacement page : LRU)
I 1 page memoire = 2Mmots = 100 colonnes de A, B, ou C(1 mot = 8 bytes)
I 1 defaut de page ≈ 10−4 secondes
I Stockage de A, B, et C :3× 400Mmots = 3× 3.2 GB = 9.6 GB
I capacite memoire : 128 pages soit:128× 2Mmots = 256Mmots = 2GB → A, B, C ne peuventetre stockees totalement
170/ 627
![Page 180: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/180.jpg)
Variante (1) : ijk
do i = 1, ndo j = 1, n
do k = 1, nCij <- Cij + Aik * Bkj
enddoenddo
enddo
1. Quel est le nombre de defauts de pages et le temps de calculde cette variante (ijk) ?
2. Quel est le nombre de defauts de pages et le temps de calculde la variante (jki) ?
3. Quel est le nombre de defauts de pages et le temps de calculde la variante (jki) avec blocage sur j et k par blocs de taille 4pages memoire ?
171/ 627
![Page 181: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/181.jpg)
Variante (1) : ijk
do i = 1, ndo j = 1, n
do k = 1, nCij <- Cij + Aik * Bkj
enddoenddo
enddo
Si acces en sequence aux colonnes d’une matrice, 1 defaut de pagetoutes les 100 colonnes.Acces a une ligne de A → n
100 = 200 defauts de page.D’ou 200× 200002 = 8× 1010 defauts de page.8× 1010 defauts de page× 10−4sec . = 8 Msec ≈ 128 jours decalcul
172/ 627
![Page 182: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/182.jpg)
Variante (2) : jki
do j = 1, ndo k = 1, n
do i = 1, nCij <- Cij + Aik * Bkj
enddoenddo
enddo
Pour chaque j :
I toutes colonnes de A accedees : n*200 defauts de page
I acces aux colonnes de B et C : 200 defauts de page
I total ≈ 4× 106 defauts de page
Temps d’execution ≈ 4× 106 × 10−4 sec = 400 sec
173/ 627
![Page 183: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/183.jpg)
Variante (3) : jki bloqueLes matrices sont partitionees en blocs de colonnes tq bloc-colonne(nb = 400 colonnes) = 4 pages memoire.
Reutilisation maximale des sous-matrices en memoire.
* Organisation des calculs sur des sous-matricesdo j = 1, n, nb
jb = min(n-j+1,nb)do k = 1, n, nb sectioning loops
kb = min(n-k+1,nb)* Multiplication sur les sous-matrices* C1:n,j:j+jb-1 <- C1:n,j:j+jb-1* + A1:n,k:k+kb-1 * Bk:k+kb-1,j:j+jb-1
do jj = j, j+jb-1do kk = k, k+kb-1
do i = 1, nCijj <- Cijj + Aikk * Bkkjj
enddo enddo enddoenddo
enddo
![Page 184: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/184.jpg)
Defauts de page :
I nb = 400 colonnes (4 pages memoire)
I acces a B et C, defauts de page lors de la boucle en j: 200defauts de page
I n/nb acces (boucle en j) a A par blocs de colonnes, pourchaque indice k : 200, soit n/nb × 200 au total.
I Total ≈ ( nnb + 2)× 200 defauts de page
I nb = 400 donc nnb = 50
I et donc ≈ 104 defauts de page
I Temps de chargement memoire = 1 sec
Attention : le temps de calcul n’est plus negligeable !!Temps = 2× n3/vitesse ≈ 320 secondesIdees identiques au blocage pour cacheBlocage : tres efficace pour exploiter au mieux une hierarchiememoire (cache, memoire virtuelle, . . . )
175/ 627
![Page 185: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/185.jpg)
Calculateurs haute-performance: concepts generauxIntroductionOrganisation des processeursOrganisation memoireOrganisation interne et performance des processeurs vectorielsOrganisation des processeurs RISCReutilisation des donnees (dans les registres)Memoire cacheReutilisation des donnees (dans les caches)Memoire virtuelleReutilisation des donnees (en memoire)Interconnexion des processeursLes supercalculateurs du top 500 en Juin 2007Conclusion
176/ 627
![Page 186: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/186.jpg)
Interconnexion des processeurs
I Reseaux constitues d’un certain nombre de boıtes deconnexion et de liens
I Commutation de circuits : chemin cree physiquement pourtoute la duree d’un transfert (ideal pour un gros transfert)
I Commutation de paquets : des paquets formes de donnees +controle trouvent eux-meme leur chemin
I Commutation integree : autorise les deux commutationsprecedentes
I Deux familles de reseaux distincts par leur conception et leurusage :
I Reseaux mono-etageI Reseaux multi-etages
177/ 627
![Page 187: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/187.jpg)
Reseau Crossbar
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
0
1
2
3
1 2 3o
Toute entree peut etre connectee a toute sortie sans blocage.Theoriquement, le plus rapide des reseaux mais concevableseulement pour un faible nombre d’Entrees/Sortie.Utilise sur calculateurs a memoire partagee : Alliant, Cray, Convex,. . .
178/ 627
![Page 188: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/188.jpg)
Reseaux multi-etages
Constitues de plus d’un etage de boitiers de connexion. Systemede communication permettant le plus grand nombre possible depermutations entre un nombre fixe d’entrees et de sorties.A chaque entree (ou sortie) est associee une unite fonctionnelle.Nombre d’entrees = nombre de sorties = 2p.
0 1 2 3 4 5 6
0 1 2 3 4 5 6 7
7
Figure: Exemple de reseau multi-etage avec p=3.
Reseaux birectionnels ou doublement du reseau.179/ 627
![Page 189: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/189.jpg)
Boıte de connexion elementaireElement de base dans la construction d’un reseau : connexionentre deux entrees et deux sorties
I Boıte a deux fonctions (B2F) permettant les connexionsdirecte et croisee controlee par un bit
I Boıte a quatre fonctions (B4F) permettant les connexionsdirecte, croisee,a distribution basse et haute controlee pardeux bits.
180/ 627
![Page 190: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/190.jpg)
I Topologie : mode d’assemblage des boıtes de connexion pourformer un reseau de N = 2p entrees / N sorties. La plupartdes reseaux sont composes de p etages de N
2 boıtes.
I Exemple : Reseau OmegaTopologie basee sur le “Perfect Shuffle”, permutation sur desvecteurs de 2p elements.
0 1 2 4 5 6
3 4 6
3 7
75210Le reseau Omega reproduit a chaque etage un “PerfectShuffle”. Autorise la distribution d’une entree sur toutes lessorties (“broadcast”).
181/ 627
![Page 191: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/191.jpg)
0
1
2
4
0
3
5
6
1
2
3
4
5
6
77
0
1
2
3
4
5
6
7
0
1
2
4
5
6
7
3
A
B
C
D
E
F
G
H
I
J
K
L
Reseau Omega 8× 8.
182/ 627
![Page 192: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/192.jpg)
I Autre topologie possible (reseau Butterfly, BBN, Meiko CS2)
0
1
2
4
6
3
5
7
6
4
3
0
1
2
5
7
A
B
C
D
E
F
G
H
I
J
K
L
183/ 627
![Page 193: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/193.jpg)
Reseaux mono-etage
I Realisent un nombre fini de permutations entre les entrees etles sorties, chacune de ces permutations faisant l’objet d’uneconnexion physique (en general canal bidirectionnel).Generalement statique.
Proc 1 Proc 2 Proc 3
Proc 4 Proc 5 Proc 6
I Tres utilise dans les architectures a memoire locale
184/ 627
![Page 194: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/194.jpg)
I Exemples :I Bus partage
Proc#0
Proc#n
Cache LocalMemory
MemoryLocalCache
MainMemory
BUS
Largement utilise sur SMP (SGI, SUN, DEC, . . . )185/ 627
![Page 195: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/195.jpg)
I Anneau
Proc 1 Proc 2 Proc nProc 0
I GrilleProc Proc Proc Proc
ProcProcProc
Proc Proc Proc Proc
ProcProcProcProc
Proc
Utilise sur Intel DELTA et PARAGON, . . .186/ 627
![Page 196: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/196.jpg)
I Shuffle Exchange : Perfect Shuffle avec en plus Proc # iconnecte a Proc # (i+1)
1 2 3 4 5 6 70
I N-cube ou hypercube : Proc #i connecte au Proc # j si i et jdifferent d’un seul bit.
0 1 2 3 4 5 6 7
I Grand classique utilise sur hypercubes Intel (iPSC/1, iPSC/2,iPSC/860), machines NCUBE, CM2, . . .
187/ 627
![Page 197: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/197.jpg)
Figure: 4-Cube in space.
188/ 627
![Page 198: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/198.jpg)
Topologies usuelles pour les architectures distribuees
I Notations :I # procs = N = 2p
I diametre = d (chemin critique entre 2 procs)I # liens = w
I Anneau : d = N2 ,w = N
I Grille 2D : d = 2× (N12 − 1),w = 2× N
12 × (N
12 − 1)
I Tore 2D (grille avec rebouclage sur les bords) :
d = N12 ,w = 2× N
Proc Proc Proc Proc
ProcProcProc
Proc Proc Proc Proc
ProcProcProcProc
Proc
I Hypercube ou p-Cube : d = p,w = N×p2
189/ 627
![Page 199: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/199.jpg)
Remarques
I Tendance actuelle:I Reseaux hierarchiques/multi-etagesI Beaucoup de redondances (bande passante, connections
simultanees)
I Consequence sur les calculateurs haute performance:I Peu de difference de cout selon sources/destinationsI La conception des algorithmes paralleles ne prend plus en
compte la topologie des reseaux (anneaux, . . . )
190/ 627
![Page 200: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/200.jpg)
Calculateurs haute-performance: concepts generauxIntroductionOrganisation des processeursOrganisation memoireOrganisation interne et performance des processeurs vectorielsOrganisation des processeurs RISCReutilisation des donnees (dans les registres)Memoire cacheReutilisation des donnees (dans les caches)Memoire virtuelleReutilisation des donnees (en memoire)Interconnexion des processeursLes supercalculateurs du top 500 en Juin 2007Conclusion
191/ 627
![Page 201: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/201.jpg)
Statistiques Top 500 (voir www.top500.org)
I Liste des 500 machines les plus puissantes au monde
I Mesure: GFlops/s pour pour la resolution deAx = b, A matrice dense.
I Mises a jour 2 fois par an (Juin/ISC, Novembre/SC).I Sur les 10 dernieres annees la performance a augmente plus
vite que la loi de Moore:I 1997:
I #1 = 1.1 TFlop/sI #500 = 7.7 GFlop/s
I 2007:I #1 = 280 TFlop/sI #500 = 4 TFlop/s
I 2008: RoadrunnerI #1 = 1 PFlop/s (1026 TFlop/s)I #500 = 4 TFlop/s
192/ 627
![Page 202: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/202.jpg)
Quelques remarques generales (Juin 2007)
I Architectures IBM Blue Gene dominent dans le top 10.
I NEC ”Earth simulator supercomputer” (36 Tflop/s, 5120processeurs vectoriels) est aujourd’hui numero 20. Est resteen tete de Juin 2002 a Juin 2004.
I Il faut 56 Tflop/s pour entrer dans le Top 10(contre 15 TFlop/s en juin 2005)
I Somme accumulee: 4.95 PFlop/s(contre 1.69 PFlop/s en juin 2005)
I Le 500 ieme (4 Tflop/s) aurait ete 216 eme il y a 6 mois.
193/ 627
![Page 203: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/203.jpg)
Remarques generales (Juin 2007 suite)
I Domaine d’activiteI Recherche 25%, Accademie 18%, Industrie 53%I Par contre 100% du TOP10 pour recherche et accademie.I France (10/500) dont 8 pour l’industrie.
I ProcesseursI 289 systemes bases sur de l’Intel (dont 31→ 205 sur le Xeon
Woodcrest, bi-cœur)I 107 sur des AMD (dont 90 : bi-cœurs Opteron)I 85 sur de l’IBM Power 3, 4 ou 5I 10 sur des HP PA-RISCI 4 sur des NEC (vectoriels)I 3 sur des SparcI 2 sur des CRAY (vectoriels)I 6/500 (18/500 en 2005) bases sur des processeurs vectoriels.
I Architecture107 MPP (Cray SX1, IBM SP, NEC SX, SGI ALTIX, HitatchiSR) pour 393 Clusters
194/ 627
![Page 204: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/204.jpg)
Analyse des sites - Definitions
I Rang: Position dans le top 500.
I Rpeak: Performance crete de la machine en nombred’operations flottantes par secondes.
I Rmax: Performance maximum obtenue sur le test LINPACK.
I Nmax: Taille du probleme ayant servi a obtenir Rmax.I Power: Watts consommes (voir aussi www.green500.org)
I Plus/moins performant du top 500: 480 Mflops/Watt et 4Mflops/Watt
I Juin 2008, #1 Top500: 437 Mflops/Watt est 3ieme augreen500 (#2 : 205 Mflops/Watt)
I Gain de 131 Mflops/watt par rapport a Novembre 2007(utilisation du processeur Cell, voir Section 2 Introduction)
I Gain de 0.4Mflops (entre Juin 2007 et 2008) seulement sur lebas du classement
195/ 627
![Page 205: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/205.jpg)
Top 10 mondial (Juin 2007)
Rang-Configuration Implantation #proc. Rmax Rpeak YearTFlops TFlops
1-IBM eServer BlueGene DOE/NNSA/LLNL 131072 280 367 20052-Cray XT4/XT3 1 Oak Ridge Nationl Lab 23016 101 119 20063-Cray RedStorm 2 NNSA/Sandia Lab 26544 101 127 20064-IBM eServer BlueGene IBM TJWatson Res. Ctr. 40960 91 114 20055-IBM eServer BlueGene New York Ctr. in CS 36864 82 103 20076-IBM eServer pSeries 3 DOE/NNSA/LLNL 12208 75 92 20067-IBM eServer Blue Gene Nonotechnology 4 32768 73 91 20078-DELL PowerEdge 5 Nat.Ctr. Supercomp. Appl. 10240 62 94 20079-IBM cluster 6 Barcelona Supercomp. Ctr. 10240 62 94 200610-SGI Altix4700-1.6GHz Leibniz Rechenzentrum 9728 56 62 200712-Tera-10 Novascale 7 CEA 9968 52 63 2006
1Opteron 2.6Hz dual core
2Opteron 2.4Hz dual core
3p5 1.9GHz
4Rensselaer Polytech. Inst. (nanotech.)
52.33GHz-Infinib.
6PPC-2.3GHz-Myri.
7Ita2-1.6GHz-Quadrics
196/ 627
![Page 206: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/206.jpg)
Top 7 mondial (Juin 2005)
Rang-Configuration Implantation #proc. Rmax Rpeak Nmax
TFlops TFlops 103
1-IBM eServer BlueGene Solution DOE/NNSA/LLNL 65536 136 183 12782-IBM eServer BlueGene Solution IBM TJWatson Res. Ctr. 40960 91 114 9833-SGI Altix 1.5GHz NASA/Ames Res.Ctr./NAS 10160 51 60 12904-NEC Earth simulator Earth Simul. Center. 5120 36 41 10755-IBM cluster, PPC-2.2GHz-Myri. Barcelona Supercomp. Ctr. 4800 27 42 9776-IBM eServer BlueGene Solution ASTRON/Univ. Groningen 12288 27 34 5167-NOW Itanium2-1.4GHz-Quadrix Los Alamos Nat. Lab. 8192 19 22 975
Stockage du probleme de taille 106 = 8 Terabytes
197/ 627
![Page 207: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/207.jpg)
Constructeur Nombre Pourcent.∑
Rmax∑
Rpeak∑
Procs(TFlop/s) (TFlop/s)
IBM 192 38.4 2060 3121 679128HP 201 40.2 1193 1860 227028Dell 22 4.4 427 616 67264Cray Inc. 11 2.2 359 438 81070SGI 19 3.8 281 317 48464NEC 4 0.8 53 59 5952Self-made 5 1.0 48 79 10056Sun 7 1.4 43 59 5952Fujitsu 4 0.8 25 47 7488All 500 100 4946 7183 1221114
Statistiques constructeurs Top 500, nombre de systemes installes.
![Page 208: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/208.jpg)
Analyse des sites francais – Juin 2007Rang-Configuration Implantation #proc. Rmax Rpeak
GFlops GFlops
12-NovaScale 51608 CEA 9968 52840 6379822-NovaScale 30459 CEA 6144 35130 3932138-IBM Blue Gene L EDF R&D 8192 18665 22937110-HP Cluster 10 HP 1024 8751 12288238-HP Cluster 11 Industrie alim. 668 5210 8016329-HP Cluster 12 IT Service Prov. 640 4992 7680349-IBM BladeCenter 13 Finance 2000 4925 8800394-IBM Cluster 14 PSA Peugeot 1184 4673 6157458-459 IBM eServer 15 Total SA 1024 4307 7782480-HP Cluster Xeon 16 Industrie alim. 688 4173 6420489-490 NEC SX8R (2.2 Ghz) Meteo-France 128 4058 405
8Ita2,1.6GHz, Infiniband
9Ita2,1.6GHz, Quadrics
10Xeon-3GHz, Infiniband
11Xeon-3GHz, GigEthernet
12Xeon-3GHz, GigEthernet
13Opteron-2.2 GHz
14Opteron-2.6 GHz, Infiniband
15pSeries 1.9GHz Myrinet
162.33GHz, GigEthernet
199/ 627
![Page 209: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/209.jpg)
Analyse des sites francais – Juin 2005
Rang-Configuration Implantation #proc. Rmax Rpeak Nmax
GFlops GFlops 103
77-HP AlphaServer SC45, 1GHz CEA 2560 3980 5120 360238-HP Cluster P4 Xeon-2.4GHz Finance 512 1831 3276251-IBM Cluster Xeon2.4GHz-Gig-E Total 1024 1755 4915 335257-HP Cluster P4 Xeon-2.4GHz Caylon 530 1739 3392258-HP Cluster P4 Xeon-2.4GHz Caylon 530 1739 3392266-IBM Cluster Xeon2.4GHz-Gig-E Soc.Gen. 968 1685 4646281-IBM eServer (1.7GHz Power4+) CNRS-IDRIS 384 1630 2611359-SGI Altix 1.5GHz CEG Gramat
(armement)
256 1409 1536
384-HP Superdome 875MHz FranceTelec.
704 1330 2464
445-HP Cluster Xeon 3.2 GHz Soc.Gen. 320 1228 2048
200/ 627
![Page 210: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/210.jpg)
Repartition geographique
Afrique: 1 Oceanie : 5
Amerique: 295 Europe: 127Bresil 2 Allemagne 24Canada 10 France 13Mexique 2 Italie 5USA 281 RU 42
Espagne 6Russie 5
Asie : 72Chine 13India 8Japon 23S. Arabia 2
201/ 627
![Page 211: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/211.jpg)
Analyse des plates-formes a usage academique
Amerique: 44 Europe: 33Canada 4 Allemagne 6Etats-Unis 39 Belgique 1Mexique 1 Espagne 3Oceanie : 2 Finlande 2
Australie 1 France 0Nouvelle Zelande 1 Italie 1Asie : 11 Norvege 1Japon 8 Pays-Bas 2Chine 1 Royaume Uni 7Taiwan 1 Russie 4Coree du Sud 1 Suede 4Turquie 1 Suisse 1
202/ 627
![Page 212: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/212.jpg)
Type de processeurs
203/ 627
![Page 213: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/213.jpg)
Evolution de la performance
204/ 627
![Page 214: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/214.jpg)
Exemples d’architecture de supercalculateurs
I Machines de type scalaireI MPP IBM SP (NERSC-LBNL, IDRIS (France))I CRAY XT3/4 (Oak Ridge National Lab)I Cluster DELL (NCSA)I Non-Uniform Memory Access (NUMA) computer SGI Altix
(Nasa Ames)I IBM Blue Gene
I Machines de type vectorielI NEC (Earth Simulator Center, Japon)I CRAY X1 (Oak Ridge Nat. Lab.)
I Machine a base de processeur CellI Roadrunner (Los Alamos National Lab (LANL))
205/ 627
![Page 215: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/215.jpg)
MPP IBM SP NERSC-LBNL
Réseau
P1 P16 P1 P16
Noeud 1 Noeud 416
12Gbytes 12Gbytes
Remarque: Machine pécédente (en 2000)
Cray T3E (696 procs à 900MFlops et 256Mbytes)
416 Noeuds de 16 processeurs
375MHz processeur (1.5Gflops)Mémoire: 4.9 Terabytes
6656 processeurs (Rpeak=9.9Teraflops)
Supercalculateur du Lawrence Berkeley National Lab. (installe en2001)
206/ 627
![Page 216: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/216.jpg)
MPP IBM SP CNRS-IDRIS
Réseau
P1 P32
1.3GHz processeur (5.2Gflops)Mémoire: 1.5 Terabytes
384 processeurs (Rpeak=2.61Tflops)
128Gbytes 128Gbytes
P1 P32
Noeud 12Noeud 1
12 Noeuds de 32 processeurs
+ X noeuds de 4 procs
Supercalculateur de l’IDRIS (installe en 2004)
207/ 627
![Page 217: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/217.jpg)
Cluster DELL ”Abe” (NCSA, Illinois)
I Performance: Rpeak=94 TFlop/s peak, Rmax=62.7 TFlop/sI Architecture (9600 cores):
I 1200 nœuds (bi-Xeon) a 2.33 GHzI Chaque Xeon : 4 cœursI 4 flops/cycle/cœur (9.33 GFlop/s)I Memoire: 90 TB (1 GB par cœur)I Infiniband → applicationsI GigEthernet → systeme+monitoringI IO: 170 TB at 7.5 GB/s
208/ 627
![Page 218: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/218.jpg)
Non Uniform Memory Access Computer SGI Altix
4.1Tbytes de memoire globalement adressable
Remarque: NUMA et latence
Réseau
P1 P2P1 P2Noeud 1 Noeud 2
C−Brick 128
Noeud (145nsec); C−Brick (290ns); Entre C−Bricks(+ 150 à 400ns);
P1 P2
C−Brick 1
P1 P2Noeud 1 Noeud 2
16Gb 16Gb 16Gb 16Gb
128 C−Bricks de 2 Noeuds de 2 procs
1.5GHz Itanium 2 (6Gflops/proc)Mémoire: 4.1 Terabytes
512 processeurs (Rpeak=3.1Teraflops)
Supercalculateur SGI Altix (installe a NASA-Ames en 2004) 2007:#10=Altix, 63 TFlop/s, 9728 cœurs, 39 TB, Allemagne
209/ 627
![Page 219: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/219.jpg)
NEC Earth Simulator Center (characteristiques)
I 640 NEC/SX6 nodes
I 5120 CPU (8 GFlops) −− > 40 TFlops
I 2 $ Billions, 7 MWatts.
210/ 627
![Page 220: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/220.jpg)
NEC Earth Simulator Center (architecture)
unit
cacheRegisters
Scalar
unit
cacheRegisters
Scalar
Arithm. Proc 1 Arith. Proc. 8
UnitUnit
Noeud 640
unit
cacheRegisters
Scalar
unit
cacheRegisters
Scalar
Arithm. Proc 1 Arith. Proc. 8
UnitUnit
Noeud 1
Réseau (Crossbar complet)
640 Noeuds (8 Arith. Proc.) −> 40Tflops
(Rpeak −−> 16 flops // par AP)
Vector Vector Vector Vector
Mémoire partagée (16Gbytes) Mémoire partagée (16Gbytes)
Mémoire totale 10TBytes
Vector unit (500MHz): 8 ens. de pipes (8*2*.5= 8Glops)
Supercalculateur NEC (installe a Tockyo en 2002)
211/ 627
![Page 221: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/221.jpg)
Cray X1 d’Oak Ridge National Lab.
I Performance: 6.4 Tflop/s, 2Terabytes, Rmax(5.9 TFlop/s)I Architecture 504 Multi Stream processeurs (MSP):
I 126 NoeudsI Chaque Noeud a 4 MSP et 16Gbytes de memoire “flat”.I Chaque MSP a 4 Single Stream Processors (SSP)I Chaque SSP a une unite vectorielle et une unite superscalaire,
total 3.2Gflops.
212/ 627
![Page 222: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/222.jpg)
Cray X1 node
213/ 627
![Page 223: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/223.jpg)
Blue Gene L (65536 dual-procs, 360 TFlops peak)
I Systeme d’exploitationminimal (non threade)
I Consommation limitee:I 32 TB mais seulement
512 MB de memoire parnoeud !
I un noeud = 2 PowerPC a700 MHz (2x2.8 GFlop/s)
I 2.8 GFlop/s ou 5.6GFlop/s crete par noeud
I Plusieurs reseaux rapidesavec redondances
214/ 627
![Page 224: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/224.jpg)
Blue gene: efficace aussi en Mflops/watt
215/ 627
![Page 225: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/225.jpg)
Clusters a base de processeurs Cell
I rack QS20 = 2 processeurs Cell (512 MB / processeur)
I racks connectes entre eux par switchs GigEthernet
I Chaque Cell=205 GFlop/s (32 bits)
I Installation au CINES (Montpellier):I 2 racks IBM QS20I performance crete: 820 GFlop/sI memoire: seulement 2 GB !
I reste tres experimental et difficile a programmer
216/ 627
![Page 226: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/226.jpg)
Pour entrer dans l’ere du Petacale : Roadrunner
I Los Alamos National Lab et IBM
I 18 clusters de 170 noeuds de calcul
I Par noeud : 2 dual-core AMDOpteron et 4 IBM PowerXCell 8iproc(Machine complete : 12240PowerCell)
I Performance IBM PowerXCell 8i : 110 Glops (64 bits flottant)
I 122400 cores et 98 Terabytes
I Rmax=1026 Teraflops; Rpeak 1376 Teraflops; 2.3 MWatts
217/ 627
![Page 227: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/227.jpg)
Roadrunner (suite)
I Difference Cell BroadBand Engine(CBE) et IBM PowerXCell 8i
I Amelioration significative de laperformance des calculs 64bits(100Gflops/15Gflops)
I Memoire plus rapide
I Programmation du RoadrunnerI 3 compilateurs : Opteron,
PowerPC et Cell SPE jeud’instructions
I Gestion explicite des donnees etprogrammes entre Opteron,PowerPC et Cell.
218/ 627
![Page 228: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/228.jpg)
Machines auxquelles on a acces depuis le LIP
I Calculateurs des centres nationaux (pas dans le top 500)I IDRIS: 1024 processeurs Power4 IBM, 3 noeuds NEC SX8I CINES: 9 noeuds de 32 Power4 IBM, SGI Origin 3800 (768
processeurs), . . .
I Calculateurs regionaux/locaux:I icluster2 a Grenoble: 100 bi-processeurs itanium (en cours de
renouvellement)I clusters de la federation lyonnaise de calcul haute performanceI Grid 5000 (node in Lyon: 127 bi-processeurs opteron, 1
core/proc)
219/ 627
![Page 229: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/229.jpg)
Programmes nationaux d’equipement
USA: Advanced Simulation and Computing Program (formerlyAccelerated Strategic Initiative)
I http://www.nnsa.doe.gov/ascI Debut du projet : 1995 DOE (Dept. Of Energy)I Objectifs : 1 PetaFlop/s
France: le projet Grid 5000(en plus des centres de calcul CNRS: IDRIS et CINES)
I http://www.grid5000.orgI Debut du projet : 2004 (Ministere de la Recherche)I Objectifs : reseau de 5000 machines sur 8 sites repartis
(Bordeaux, Grenoble, Lille, Lyon, Nice, Rennes, Toulouse)
220/ 627
![Page 230: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/230.jpg)
Previsions
I BlueGeneL et ses successeurs: ≈ 3 PFlop/s en 2010
I Projet japonnais (10 Pflops en 2011).
I Juin 2008: Architectures a base de noeuds hybrides incluantdes processeurs vectoriels/Cell
221/ 627
![Page 231: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/231.jpg)
Calculateurs haute-performance: concepts generauxIntroductionOrganisation des processeursOrganisation memoireOrganisation interne et performance des processeurs vectorielsOrganisation des processeurs RISCReutilisation des donnees (dans les registres)Memoire cacheReutilisation des donnees (dans les caches)Memoire virtuelleReutilisation des donnees (en memoire)Interconnexion des processeursLes supercalculateurs du top 500 en Juin 2007Conclusion
222/ 627
![Page 232: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/232.jpg)
Conclusion
I Performance :I Horloge rapideI Parallelisme interne au processeur
I Traitement pipelineI Recouvrement, chaınage des unites fonctionnelles
I Parallelisme entre processeurs
I Mais :I Acces aux donnees :
I Organisation memoireI Communications entre processeurs
I Complexite du hardwareI Techniques de compilation : pipeline / vectorisation /
parallelisation
Comment exploiter efficacement l’architecture ?
223/ 627
![Page 233: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/233.jpg)
Ecriture de code efficace (I) : MFLOPS ou MIPS ?
I MFLOPS: floating point operations /sec.Ne depend pas du calculateur
I MIPS: instructions de bas-niveauDepend du calculateur
I Watt: code efficace sur des machines a faible consommationen Watt par proc. (Exemple des proc. Cell).
I Precision des calculs: travail partiel en precision numeriqueaffaiblie (plus efficace).
224/ 627
![Page 234: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/234.jpg)
Ecriture de code efficace (II)
I Facteurs architecturaux influencant la performance :I debit et latence memoireI couts des communications et de synchronisationI temps d’amorcage des unites vectoriellesI besoins en entrees/sorties
I Facteurs dependant de l’application :I parallelisme (depend des algorithmes retenus)
I regularite des traitementsI equilibrage des traitementsI volume de communications (localite)I granularite - scalabilite
I Localite des donnees (spatiale et temporelle)encore plus critique sur les architectures Cell et GPU(Graphical Proc Unit)
225/ 627
![Page 235: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/235.jpg)
Notion de calcul potentiellement efficace
I Proposition: Soient x et y des vecteurs et A,B,C desmatrices d’ordre n; le noyau de calcul (1) x = x + αy estpotentiellement moins efficace que le noyau (2) y = A× x + yqui est potentiellement moins efficace que le noyau (3)C = C + A× B
I Exercice : justifier la proposition precedente.
226/ 627
![Page 236: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/236.jpg)
I La mesure du rapport entre le nombre d’operations flottanteset de references memoire pour chacun des noyaux de calculexplique le potentiel.
I x = x + αyI 3n references memoireI 2n operations flottantesI rapport Flops/Ref = 2/3
I y = A× x + yI n2 references memoireI 2n2 operations flottantesI rapport Flops/Ref = 2
I C = C + A× BI 4n2 references memoireI 2n3 operations flottantesI rapport Flops/Ref = n/2
I Typiquement Vitesse (3) = 5 × vitesse(2) et vitesse(2) = 3 ×vitesse(1) . . . si on utilise des bibliotheques optimisees !
227/ 627
![Page 237: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/237.jpg)
Limites de l’optimisation de code et de lavectorisation/parallelisation automatiques
C ← α× A× B + βC (DGEMM du BLAS)
DO 40 j = 1, N................DO 30 l = 1, K
IF ( B( l, j ) .NE. ZERO ) THENTEMP = ALPHA * B( l, j )DO 20 i = 1, M
C( i, j ) = C( i, j ) + TEMP * A( i, l )20 CONTINUE
END IF30 CONTINUE40 CONTINUE
Plupart des compilateurs : parallelisent la boucle d’indice j etoptimisent / vectorisent la boucle d’indice i
228/ 627
![Page 238: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/238.jpg)
Table: Performance de versions differentes de GEMM sur processeursRISC avec des matrices 128 × 128.
Calculateur standard optimise perf. de crete
DEC 3000/300 AXP 23.1 48.4 150.0HP 715/64 16.9 38.4 128.0IBM RS6000/750 25.2 96.1 125.0
Pentium 4 113 975 3600
I Plupart des optimisations realisees par les compilateurs sur laboucle interne
I En theorie tres bon potentiel grace au rapport entreoperations flottantes et references memoire : ( 4n2 referencesmemoire, 2n3 operations flottantes)
i.e. n2 mais les compilateurs ne savent pas l’exploiter !!
229/ 627
![Page 239: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/239.jpg)
I Optimisation de code :I Ameliorer l’acces aux donnees et exploiter la localite spatiale et
temporelle des references memoireI Deroulage de boucles : reduit le nombre d’acces memoire en
ameliorant la reutilisation des registres, permet aussi unemeilleure exploitation du parallelisme interne aux processeurs
I Blocage pour une utilisation efficace du cache : ameliore lalocalite spatiale et temporelle
I Copie des donnees dans des tableaux de travail pour forcer lalocalite et eviter des ”strides” critiques (pas toujours possiblescar parfois trop couteux)
I ”prefetch” des donneesI Utilisation de l’assembleur (cas desespere !!)I Utilisation de bibliotheques optimisees (cas ideal !!)
230/ 627
![Page 240: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/240.jpg)
Utilisation d’une bibliotheque optimiseeI Noyaux de calcul matriceXmatrice optimises existent :
I ATLAS - Automatic Tuned Linear Algebra Software.http://netlib.enseeiht.fr/atlas/
I Goto from Univ. Texas at Austinhttp://www.cs.utexas.edu/users/flame/goto/
Figure: Comparaison de la performance de noyaux de calcul en algebrelineaire (BLAS) (J. Dongarra)
![Page 241: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/241.jpg)
Outline
Notions et techniques generales pour l’algebre lineaireIntroductionGaussian EliminationLU FactorizationVector and Matrix normsErreur, sensibilite, conditionnementFactorisation LU avec pivotageSystemes bandeMatrices symetriquesFactorisation de CholeskyFactorisation QRGram-Schmidt ProcessProblemes aux moindres carresProblemes aux valeurs propresDecomposition en valeurs singulieres (SVD)
232/ 627
![Page 242: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/242.jpg)
Notions et techniques generales pour l’algebre lineaireIntroductionGaussian EliminationLU FactorizationVector and Matrix normsErreur, sensibilite, conditionnementFactorisation LU avec pivotageSystemes bandeMatrices symetriquesFactorisation de CholeskyFactorisation QRGram-Schmidt ProcessProblemes aux moindres carresProblemes aux valeurs propresDecomposition en valeurs singulieres (SVD)
233/ 627
![Page 243: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/243.jpg)
Linear algebra
I Linear algebra: branch of mathematics that deals withsolutions of systems of linear equations and the relatedgeometric notions of vector spaces and linear transformations.
I “linear” comes from the fact that equation
ax + by = c
defines a line (in two-dimensional geometry).I similar to the form of a system of linear equations:
ai1x1 + ai2x2 + . . .+ ainxn = bi , i = 1, . . . ,m
I Linear transformation from a vector space V to W :T (v1 + v2) = T (v1) + T (v2)
T (αv1) = αT (v2)I Linear transformations (rotations, projections, . . . ) are often
represented by matrices. A =
0 1−2 2
1 0
, v =
[xy
], then
T : v −→ Av is a linear transformation from IR2 to IR3, defined by
T (x , y) = (y ,−2x + 2y , x).234/ 627
![Page 244: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/244.jpg)
Use of Linear algebra
Continuous problem → Discretization → Mathematicalrepresentation involving vectors and matrices
This leads to problems involving vectors and matrices, inparticular:
I systems of linear equations (sparse, dense, symmetric,unsymmetric, well conditionned, . . . )
I least-square problems
I eigenvalue problems
235/ 627
![Page 245: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/245.jpg)
I Resolution de Ax = bI A generale carree : factorisation LU avec pivotageI A symetrique definie positive : factorisations de Cholesky LLt
ou LDLt
I A symetrique indefinie : factorisation LDLt
I A rectangulaire m × n avec m ≥ n : factorisation QR
I Problemes aux moindres carres minx ||Ax − b||2I Si rang(A) maximal : factorisation de Cholesky ou QRI Sinon QR avec pivotage sur les colonnes ou Singular Value
Decomposition (SVD)
I Problemes aux valeurs propres Ax = λxI Exemple: determiner les frequences de resonnance d’un pont /
d’un avionI Techniques a base de transformations orthogonales:
decomposition de Schur, Hessenberg, reduction a une matricetri-diagonale
I Problemes generalises :I Ax = λBx et AtAx = µ2B tBx : Schur et SVD generalisee
I Implantation efficace critique
236/ 627
![Page 246: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/246.jpg)
Notions et techniques generales pour l’algebre lineaireIntroductionGaussian EliminationLU FactorizationVector and Matrix normsErreur, sensibilite, conditionnementFactorisation LU avec pivotageSystemes bandeMatrices symetriquesFactorisation de CholeskyFactorisation QRGram-Schmidt ProcessProblemes aux moindres carresProblemes aux valeurs propresDecomposition en valeurs singulieres (SVD)
237/ 627
![Page 247: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/247.jpg)
System of linear equations ?
Example:
2 x1 - 1 x2 3 x3 = 13-4x1 + 6 x2 + 5 x3 = -286 x1 + 13 x2 + 16 x3 = 37
can be written under the form:
Ax = b,
with A =
2 −1 3−4 6 5
6 13 16
, x =
x1
x2
x3
, and b =
13−28
37
238/ 627
![Page 248: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/248.jpg)
Gaussian EliminationExample:
2x1 − x2 + 3x3 = 13 (1)
−4x1 + 6x2 + 5x3 = −28 (2)
6x1 + 13x2 + 16x3 = 37 (3)
With 2 * (1) + (2) → (2) and -3*(1) + (3) → (3) we obtain:
2x1 − x2 + 3x3 = 13 (4)
0x1 + 4x2 + x3 = −2 (5)
0x1 + 16x2 + 7x3 = −2 (6)
Thus x1 is eliminated. With -4*(5) + (6) → (6):
2x1 − x2 + 3x3 = 13
0x1 + 4x2 + x3 = −2
0x1 + 0x2 + 3x3 = 6
The linear system is then solved by backward (x3 → x2 → x1)substitution: x3 = 6
3 = 2, x2 = 14 (−2− x3) = −1, and finally
x1 = 12 (13− 3x3 + x2) = 3
239/ 627
![Page 249: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/249.jpg)
Notions et techniques generales pour l’algebre lineaireIntroductionGaussian EliminationLU FactorizationVector and Matrix normsErreur, sensibilite, conditionnementFactorisation LU avec pivotageSystemes bandeMatrices symetriquesFactorisation de CholeskyFactorisation QRGram-Schmidt ProcessProblemes aux moindres carresProblemes aux valeurs propresDecomposition en valeurs singulieres (SVD)
240/ 627
![Page 250: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/250.jpg)
LU Factorization
I Find L unit lower triangular and U upper triangular such that:A = L× U
A =
2 −1 3−4 6 −5
6 13 16
=
1 0 0−2 1 0
3 4 1
× 2 −1 3
0 4 10 0 3
I Procedure to solve Ax = b
I A = LUI Solve Ly = b (descente / forward elimination)I Solve Ux = y (remontee / backward substitution)
Ax = (LU)x = L(Ux) = Ly = b
241/ 627
![Page 251: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/251.jpg)
From Gaussian Elimination to LU FactorizationA = A(1), b = b(1), A(1)x = b(1):0@ a11 a12 a13
a21 a22 a23
a31 a32 a33
1A 0@ x1
x2
x3
1A =
0@ b1
b2
b3
1A 2← 2− 1× a21/a11
3← 3− 1× a31/a11
A(2)x = b(2)0@ a11 a12 a13
0 a(2)22 a
(2)23
0 a(2)32 a
(2)33
1A 0@ x1
x2
x3
1A =
0@ b1
b(2)2
b(2)3
1A b(2)2 = b2 − a21b1/a11 . . .
a(2)32 = a32 − a31a12/a11 . . .
Finally 3← 3− 2× a32/a22 gives A(3)x = b(3)0@ a11 a12 a13
0 a(2)22 a
(2)23
0 0 a(3)33
1A 0@ x1
x2
x3
1A =
0@ b1
b(2)2
b(3)3
1Aa
(3)33 = a
(2)33 − a
(2)32 a
(2)23 /a
(2)22 . . .
Typical Gaussian elimination at step k :
a(k+1)ij = a
(k)ij −
a(k)ik
a(k)kk
a(k)kj , for i > k
(and a(k+1)ij = a
(k)ij for i ≤ k)
242/ 627
![Page 252: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/252.jpg)
From Gaussian Elimination to LU factorization
8><>: a(k+1)ij = a
(k)ij −
a(k)ik
a(k)kk
a(k)kj , for i > k
a(k+1)ij = a
(k)ij , for i ≤ k
I One step of Gaussian elimination can be written:A(k+1) = L(k)A(k) (and b(k+1) = L(k)b(k)), with
Lk =
0BBBBBBB@
1.
.1
−lk+1,k .. .−ln,k 1
1CCCCCCCAand lik =
a(k)ik
a(k)kk
.
I After n − 1 steps, A(n) = U = L(n−1) . . .L(1)A givesA = LU , with L = [L(1)]−1 . . . [L(n−1)]−1 =0BBBB@
1l21 1
.
.
.. . .
ln1 1
1CCCCA . . .
0BBBB@1
. . .
1ln,n−1 1
1CCCCA =
0BBB@1 0
..
.li,j 1
1CCCA ,
![Page 253: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/253.jpg)
LU Factorization Algorithm
I Overwrite matrix A: we store a(k)ij , k = 2, . . . , n in A(i,j)
I In the end, A = A(n) = U
do k=1, n-1L(k,k) = 1do i=k+1, n
L(i,k) = A(i,k)/A(k,k)do j=k, n (better than: do j=1,n)
A(i,j) = A(i,j) - L(i,k) * A(k,j)end do
enddoenddoL(n,n)=1
I Matrix A at each step:
0
0
0
0
0
0 0
0
0
000
0244/ 627
![Page 254: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/254.jpg)
I Avoid building the zeros under the diagonalI Before
L(n,n)=1do k=1, n-1
L(k,k) = 1do i=k+1, n
L(i,k) = A(i,k)/A(k,k)do j=k, n
A(i,j) = A(i,j) - L(i,k) * A(k,j)
I After
L(n,n)=1do k=1, n-1
L(k,k) = 1do i=k+1, n
L(i,k) = A(i,k)/A(k,k)do j=k+1, n
A(i,j) = A(i,j) - L(i,k) * A(k,j)
245/ 627
![Page 255: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/255.jpg)
I Use lower triangle of array A to store L(i,k) multipliers
I Before:
L(n,n)=1do k=1, n-1
L(k,k) = 1do i=k+1, n
L(i,k) = A(i,k)/A(k,k)do j=k+1, n
A(i,j) = A(i,j) - L(i,k) * A(k,j)
I After (diagonal 1 of L is not stored):
do k=1, n-1do i=k+1, n
A(i,k) = A(i,k)/A(k,k)do j=k+1, n
A(i,j) = A(i,j) - A(i,k) * A(k,j)
246/ 627
![Page 256: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/256.jpg)
I More compact array syntax (Matlab, scilab, Fortran 90):
do k=1, n-1A(k+1:n,k) = A(k+1:n,k) / A(k,k)A(k+1:n,k+1:n) = A(k+1:n,k+1:n)
- A(k+1:n,k) * A(k,k+1:n)end do
I corresponds to a rank-1 update:
A(k,k) A(k,j)
k
k A(k,k+1:n)
A(k+1:n,k)
A(i,k) A(i,j)i
j
Computed elements of U
L multipliers
247/ 627
![Page 257: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/257.jpg)
What we have computed
I we have stored the L and U factors in A:
I A(i,j), i > j corresponds to lijI A(i,j), i ≤ j corresponds to uij
I and we had lii = 1, i = 1, n
I Finally,
U
L
A = L + U − I
248/ 627
![Page 258: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/258.jpg)
Nombre d’operations flottantes (flops)
I Dans la descente Ly = b calcul de la k-eme inconnue
yk = bk −k−1∑j=1
Lkjyj
Soit (k-1) multiplications et (k-1) additions, k de 1 a n-1
Donc n2 − n flops au total
I Idem pour la remontee Ux = yI Nombre de flops dans la factorisation de Gauss:
I n − k divisionsI (n − k)2 multiplications, (n − k)2 additionsI k = 1, 2, ..., n − 1I total: ≈ 2×n3
3
249/ 627
![Page 259: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/259.jpg)
Exercise
do k=1, n-1A(k+1:n,k) = A(k+1:n,k) / A(k,k)A(k+1:n,k+1:n) = A(k+1:n,k+1:n)
- A(k+1:n,k) * A(k,k+1:n)end do
Compute the LU factorization of A =
2 −1 3−4 6 −5
6 13 16
Answer: A =
1 0 0−2 1 0
3 4 1
× 2 −1 3
0 4 10 0 3
250/ 627
![Page 260: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/260.jpg)
Exercise
do k=1, n-1A(k+1:n,k) = A(k+1:n,k) / A(k,k)A(k+1:n,k+1:n) = A(k+1:n,k+1:n)
- A(k+1:n,k) * A(k,k+1:n)end do
Compute the LU factorization of A =
2 −1 3−4 6 −5
6 13 16
Answer: A =
1 0 0−2 1 0
3 4 1
× 2 −1 3
0 4 10 0 3
250/ 627
![Page 261: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/261.jpg)
Remark
I Assume that a decomposition A = LU exists withI L=(lij)i,j=1...n lower triangular with unit diagonalI U=(uij)i,j=1...n upper triangular
I Computing the LU product, we have:aij =
∑i−1k=1 likukj + uij for i ≤ j
aij =∑j−1
k=1 likukj + lijujj for i > j
I Renaming i → K in the 1st equation and j → K in the 2nd,
IK
uKj = aKj −
∑K−1k=1 lKkukj for j ∈ K ; ...; N
liK = 1uKK
(aiK −∑K−1
k=1 likukK ) for i ∈ K + 1; ...; N
I Explicit computation of uKj and liK for K = 1 to n
I Finally, same computations are performed but in a differentorder (called left-looking)
251/ 627
![Page 262: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/262.jpg)
Notions et techniques generales pour l’algebre lineaireIntroductionGaussian EliminationLU FactorizationVector and Matrix normsErreur, sensibilite, conditionnementFactorisation LU avec pivotageSystemes bandeMatrices symetriquesFactorisation de CholeskyFactorisation QRGram-Schmidt ProcessProblemes aux moindres carresProblemes aux valeurs propresDecomposition en valeurs singulieres (SVD)
252/ 627
![Page 263: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/263.jpg)
Vector Norms
Definition
A vector norm is a function f : IRn −→ IRn such that
f (x) ≥ 0 x ∈ IRn, f (x) = 0⇔ x = 0
f (x + y) ≤ f (x) + f (y) x , y ∈ IRn
f (αx) = |α|f (x) α ∈ IR, x ∈ IRn
p-norm: ‖x‖p = (|x1|p + |x2|p + . . .+ |xn|p)1p
Most important p-norms are 1, 2, and ∞ norm:
‖x‖1 = |x1|+ |x2|+ . . .+ |xn|‖x‖2 = (|x1|2 + |x2|2 + . . .+ |xn|2)
12 = (xT x)
12
‖x‖∞ = max1≤i≤n
|xi |
253/ 627
![Page 264: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/264.jpg)
Vector Norms – Some properties
I Cauchy-Schwarz inequality: |xT y | ≤ ‖x‖2‖y‖2
(Proof based on 0 ≤ ‖x − λy‖2 with λ = xT y‖y‖2 )
I All norms on IRn are equivalent:∀‖.‖α and ‖.‖β, ∃c1, c2 s.t. c1‖x‖α ≤ ‖x‖β ≤ c2‖x‖α
I In particular:
‖x‖2 ≤ ‖x‖1 ≤√
n‖x‖2
‖x‖∞ ≤ ‖x‖2 ≤√
n‖x‖∞‖x‖∞ ≤ ‖x‖1 ≤ n‖x‖∞
254/ 627
![Page 265: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/265.jpg)
Matrix Norms
I As for vector norms,f (A) ≥ 0 A ∈ IRm×n, f (A) = 0⇔ A = 0f (A + B) ≤ f (A) + f (B) A,B ∈ IRm×n
f (αA) = |α|f (A) α ∈ IR,A ∈ IRm×n
I Most matrix norms satisfyI ‖AB‖ ≤ ‖A‖ × ‖B‖
I Norms induced by p norms on vectors:
‖A‖p = maxx 6=0
‖Ax‖p‖x‖p
= max‖x‖p=1
‖Ax‖p‖A‖1 = max
1≤j≤n
∑mi=1 |aij |
‖A‖∞ = max1≤i≤m
∑nj=1 |aij |
‖A‖p ≥ ρ(A) = max1≤i≤n
|λi |
I Frobenius norm:‖A‖F =
√∑mi=1
∑nj=1 |aij |2 =
∑i σ
2i = trace(AT A)
255/ 627
![Page 266: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/266.jpg)
Notions et techniques generales pour l’algebre lineaireIntroductionGaussian EliminationLU FactorizationVector and Matrix normsErreur, sensibilite, conditionnementFactorisation LU avec pivotageSystemes bandeMatrices symetriquesFactorisation de CholeskyFactorisation QRGram-Schmidt ProcessProblemes aux moindres carresProblemes aux valeurs propresDecomposition en valeurs singulieres (SVD)
256/ 627
![Page 267: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/267.jpg)
I Considerons le systeme lineaire:[.780 .563.913 .659
]× [x ] =
[.217.254
]I Supposons que l’on obtient les resultats suivants par 2
methodes differentes :
x1 =
[0.314−0.87
]et x2 =
[0.999−1.00
]I Quelle solution est la meilleure ?
I Residus :
b − Ax1 =
[.0000001
0
]et b − Ax2 =
[.001343.001572
]I x1 est la meilleure solution car possede le plus petit residu
I Solution exacte :
x∗ =
[1−1
]I En realite x2 est plus precis.
Notion de bonne solution : ambigu257/ 627
![Page 268: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/268.jpg)
Sensibilite des problemes
I Soit A : [.780 .563.913 .659
]matrice presque singuliere
I Soit A′ : [.780 .563001095.913 .659
]matrice singuliere
→ une perturbation des donnees en O(10−6) rend le problemeinsoluble
I Autre probleme si A proche de matrice singuliere : petitchangement sur A et/ou b → perturbations importantes surla solution
Cela n’est pas lie a l’algorithme de resolution utilise
258/ 627
![Page 269: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/269.jpg)
Representation des reels en machine
I Reels codes en machine avec nombre fini de chiffres
I Representation normalisee d’un reel flottant normalise:
x = (−1)sm × 2e
I Plupart des calculateurs base = 2 (norme IEEE), mais aussi 8(octal) ou 16 (IBM), 10 (calculettes)
I macheps : precision machine i.e. plus petit reel positif tel que1 + macheps > 1
I Norme IEEE definit:I format des nombresI modes d’arrondis possiblesI traitement des exceptions (overflow, division par zero, . . . )I procedures de conversion (en decimal, . . . )I l’arithmetique
259/ 627
![Page 270: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/270.jpg)
I Simple precision IEEE :
31 | 30 23 | 22 0_________________________________________s | exposant | mantisse
Exposant code sur 8 bits, mantisse 23 bits plus 1 implicite.I Double precision IEEE :
63 | 62 52 | 51 0________________________________________s | exposant | mantisse
Exposant sur 11 bits, mantisse 52 bits plus 1 impliciteI Simple precision :
I macheps ≈ 1.2× 10−7
I xmin ≈ 1.2× 10−38
I xmax ≈ 3.4× 1038
I Double precision :I macheps ≈ 2.2× 10−16
I xmin ≈ 2.2× 10−308
I xmax ≈ 1.8× 10308
260/ 627
![Page 271: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/271.jpg)
Nombres speciaux
I ±∞ : signe, mantisse=0, exposant max
I NaN : signe, mantisse 6= 0, exposant max
I ±0 : signe, mantisse = 0, exposant min
I Nombres denormalises: signe, mantisse 6= 0, exposant min
Remarques
I 0/0,√−1→ NaN
I 1/(−0)→ −∞I NaN op x → NaN
I Exceptions: overflows, underflows, divide by zero, Invalid(NaN)
I Possibilite d’arret avec un message d’erreur ou bien poursuitedes calculs
261/ 627
![Page 272: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/272.jpg)
Analyse d’erreur en arithmetique flottante
I Avec la norme IEEE (modele pour le calcul a precision finie):fl(x op y) = (x op y)(1 + ε) avec |ε| ≤ u
I fl(x): x represente en arithmetique flottanteI op = +, −, ×, /I u = macheps: precision machine
I Exemple:
fl(x1 + x2 + x3) = fl((x1 + x2) + x3)
= ((x1 + x2)(1 + ε1) + x3) (1 + ε2)
= x1(1 + ε1)(1 + ε2) + x2(1 + ε1)(1 + ε2) + x3(1 + ε3)
= x1(1 + e1) + x2(1 + e2) + x3(1 + e3)
avec chaque |ei | < 2 macheps.I Somme exacte de valeurs modifiees xi (1 + ei ), avec |ei | < 2uI Analyse d’erreur inverse: un algorithme est dit backward
stable s’il donne la solution exacte pour des donneeslegerement modifiees (ici xi (1 + ei )). 262/ 627
![Page 273: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/273.jpg)
Analyse d’erreur inverse
I solution approchee = solution exacte d’un probleme modifieI quelle taille d’erreur sur les donnees peut expliquer l’erreur sur
la solution ?I solution approchee OK si solution exacte d’un probleme avec
des donnees proches
erreurdirecte
erreurinverse
G
F
F
x
x’
y
y’ = F(x’)
Conditionnement
I Pb bien conditionne: ‖x − x ′‖ petit ⇒ ‖f (x)− f (x ′)‖ petit
I Sinon: probleme sensitif ou mal conditionne
I Sensibilite ou conditionnement: changement relatif solution /changement relatif donnees
= | f (x ′)−f (x)f (x) |/| (x
′−x)x |
263/ 627
![Page 274: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/274.jpg)
Erreur sur la resolution de Ax = b
I Representation de A (et b) en machine inexacte : resolutiond’un probleme perturbe
(A + E )x = b + f
avec E = (eij), |eij | ≤ u × |aij | et |fi | ≤ u × |bi |.x : meilleure solution accessible
I A quel point x est proche de x ?
I Si un algorithme calcule xalg et ‖x − xalg‖/‖x‖ est grand,deux raisons possibles:
I le probleme mathematique est tres sensible aux perturbations(et alors, ‖x − x‖ pourra etre grand aussi)
I l’algorithme se comporte mal en precision finie
I L’analyse des erreurs inverses permet de discriminer ces deuxcas (Wilkinson, 1963[?])
264/ 627
![Page 275: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/275.jpg)
Notion de conditionnement d’un systeme lineaire
AF7−→ x t.q. Ax = b
A + ∆AF7−→ x + ∆x t.q. (A + ∆A)(x + ∆x) = b
Alors‖∆x‖‖x‖
≤ K (A)‖∆A‖‖A‖
avec K (A) = ‖A‖‖A−1‖.I K (A) est le conditionnement de l’application F .
I Si ‖∆A‖ ≈ macheps‖A‖ (precision machine) alors erreurrelative ≈ K (A)×macheps
(A singuliere : κ(A) = +∞)
265/ 627
![Page 276: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/276.jpg)
Backward error of an algorithm
I Let x be the computed solution. We have ([?]):
err = min ε > 0 such that ‖∆A‖ ≤ ε‖A‖, ‖∆b‖ ≤ ε‖b‖,(A + ∆A)x = b + ∆b
=‖Ax − b‖
‖A‖‖x‖+ ‖b‖.
I Proof:I
(A + ∆A)x = b + ∆b
⇒ b − Ax = ∆b −∆Ax
⇒ ‖b − Ax‖ ≤ ‖∆A‖‖x‖+ ‖∆b‖⇒ ‖r‖ ≤ ε(‖A‖‖x‖+ ‖b‖)
⇒ ‖r‖‖A‖‖x‖+ ‖b‖
≤ minε = err
266/ 627
![Page 277: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/277.jpg)
Backward error of an algorithm
I Let x be the computed solution. We have ([?]):
err = min ε > 0 such that ‖∆A‖ ≤ ε‖A‖, ‖∆b‖ ≤ ε‖b‖,(A + ∆A)x = b + ∆b
=‖Ax − b‖
‖A‖‖x‖+ ‖b‖.
I Proof:
I Bound is attained for ∆Amin =‖A‖
‖x‖(‖A‖‖x‖+ ‖b‖)r xT and
∆bmin =‖b‖
‖A‖‖x‖+ ‖b‖r .
We have ∆Aminx −∆bmin = r with
‖∆Amin‖ =‖A‖‖r‖
‖A‖‖x‖+ ‖b‖and ‖∆bmin‖ =
‖b‖‖r‖‖A‖‖x‖+ ‖b‖
.
266/ 627
![Page 278: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/278.jpg)
Backward error of an algorithm
I Let x be the computed solution. We have ([?]):
err = min ε > 0 such that ‖∆A‖ ≤ ε‖A‖, ‖∆b‖ ≤ ε‖b‖,(A + ∆A)x = b + ∆b
=‖Ax − b‖
‖A‖‖x‖+ ‖b‖.
I Proof:
I Furthermore, it can be shown thatRelative forward error ≤ Condition Number × Backward Error
266/ 627
![Page 279: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/279.jpg)
Ce qu’il faut retenir
I Conditionnement (cas general):
κ(A, b) = ‖A−1‖(‖A‖+‖b‖‖x‖
)
mesure la sensibilite du probleme mathematique
I Erreur inverse d’un algorithme: ‖Ax−b‖‖A‖‖x‖+‖b‖ .
→mesure la fiabilite de l’algorithme
→a comparer a la precision machine ou a l’incertitude sur lesdonnees
I Prediction de l’erreur:Erreur directe ≤ conditionnement × erreur inverse
267/ 627
![Page 280: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/280.jpg)
Notions et techniques generales pour l’algebre lineaireIntroductionGaussian EliminationLU FactorizationVector and Matrix normsErreur, sensibilite, conditionnementFactorisation LU avec pivotageSystemes bandeMatrices symetriquesFactorisation de CholeskyFactorisation QRGram-Schmidt ProcessProblemes aux moindres carresProblemes aux valeurs propresDecomposition en valeurs singulieres (SVD)
268/ 627
![Page 281: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/281.jpg)
Soit A =
[ε 11 1
]=
[1 01ε 1
]×[ε 10 1− 1
ε
]κ2(A) = 1 + O(ε). Si on resoud :[
ε 11 1
] [x1
x2
]=
[1 + ε
2
]Solution exacte x∗ = (1, 1).
269/ 627
![Page 282: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/282.jpg)
En faisant varier ε on a :
ε ‖x∗−x‖‖x∗‖
10−3 6× 10−6
10−6 2× 10−11
10−9 9× 10−8
10−12 9× 10−5
10−15 7× 10−2
Table: Precision relative de la solution en fonction de ε.
I Donc meme si A bien conditionnee : elimination de Gaussintroduit des erreurs
I Explication : le pivot ε est trop petit
270/ 627
![Page 283: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/283.jpg)
I Solution : echanger les lignes 1 et 2 de A[1 1ε 1
] [x1
x2
]=
[2
1 + ε
]→ precision parfaite !
I Pivotage partiel : pivot choisi a chaque etape = plus grandelement de la colonne
I Avec pivotage partiel :1. PA = LU ou P matrice de permutation2. Ly = Pb3. Ux = y
I LU avec pivotage: backward stable
‖Ax − b‖‖A‖ × ‖x‖ ≈ u(1)
‖x − x∗‖‖x∗‖ ≈ u × κ(A) (2)
1. la LU donne de faibles residus independamment duconditionnement de A
2. la precision depend du conditionnementsi u ≈ 10−q et κ∞(A) ≈ 10p alors x a approximativement(q − p) chiffres corrects 271/ 627
![Page 284: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/284.jpg)
Factorisation LU avec pivotage
do k = 1 a n-1find l such that
|A(l,k)| = max |A(j,k)|, j = k a n if |A(l,k)| = 0exit. // A is (almost) singular
endifif k != l, swap rows k and l in A (and in b)A(k+1:n,k) = A(k+1:n,k) / A(k,k)A(k+1:n,k+1:n) = A(k+1:n,k+1:n)
- A(k+1:n,k)*A(k,k+1:n)end do
272/ 627
![Page 285: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/285.jpg)
Notions et techniques generales pour l’algebre lineaireIntroductionGaussian EliminationLU FactorizationVector and Matrix normsErreur, sensibilite, conditionnementFactorisation LU avec pivotageSystemes bandeMatrices symetriquesFactorisation de CholeskyFactorisation QRGram-Schmidt ProcessProblemes aux moindres carresProblemes aux valeurs propresDecomposition en valeurs singulieres (SVD)
273/ 627
![Page 286: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/286.jpg)
Systemes bande
| x x 0 0 0 || x x x 0 0 |
A = | 0 x x x 0 | largeur de bande = 3| 0 0 x x x | A tridiagonale| 0 0 0 x x |
Exploitation de la structure bande lors de la factorisation : L et Ubidiagonales
| x 0 0 0 0 | | x x 0 0 0 || x x 0 0 0 | | 0 x x 0 0 |
L = | 0 x x 0 0 | U = | 0 0 x x 0 || 0 0 x x 0 | | 0 0 0 x x || 0 0 0 x x | | 0 0 0 0 x |
→ on peut donc reduire le nombre d’operations
274/ 627
![Page 287: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/287.jpg)
Systemes bande
I KL: nombre de sous-diagonales de A
I KU: nombre de sur-diagonales de A
I KL+KU+1: largeur de bande
Question: Si p = KL = KU (largeur totale, 2p+1), quel est lenombre d’operations de l’algo de factorisation LU (sans pivotage)?
Reponse:(n − p)× (p divisions + p2 multiplications + p2 additions ) +23 (p − 1)3)
≈ 2np2 flops (quand n >> p), au lieu de 2n3
3 .
Pivotage partiel ⇒ la largeur de bande augmente !!
I echange des lignes k et i, A(i,k)=max(A(j,k), j > k)
I KL’ = KL
I KU’ = KL + KU
275/ 627
![Page 288: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/288.jpg)
Systemes bande
I KL: nombre de sous-diagonales de A
I KU: nombre de sur-diagonales de A
I KL+KU+1: largeur de bande
Question: Si p = KL = KU (largeur totale, 2p+1), quel est lenombre d’operations de l’algo de factorisation LU (sans pivotage)?Reponse:(n − p)× (p divisions + p2 multiplications + p2 additions ) +23 (p − 1)3)
≈ 2np2 flops (quand n >> p), au lieu de 2n3
3 .
Pivotage partiel ⇒ la largeur de bande augmente !!
I echange des lignes k et i, A(i,k)=max(A(j,k), j > k)
I KL’ = KL
I KU’ = KL + KU
275/ 627
![Page 289: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/289.jpg)
Systemes bande
I KL: nombre de sous-diagonales de A
I KU: nombre de sur-diagonales de A
I KL+KU+1: largeur de bande
Question: Si p = KL = KU (largeur totale, 2p+1), quel est lenombre d’operations de l’algo de factorisation LU (sans pivotage)?Reponse:(n − p)× (p divisions + p2 multiplications + p2 additions ) +23 (p − 1)3)
≈ 2np2 flops (quand n >> p), au lieu de 2n3
3 .
Pivotage partiel ⇒ la largeur de bande augmente !!
I echange des lignes k et i, A(i,k)=max(A(j,k), j > k)
I KL’ = KL
I KU’ = KL + KU
275/ 627
![Page 290: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/290.jpg)
Notions et techniques generales pour l’algebre lineaireIntroductionGaussian EliminationLU FactorizationVector and Matrix normsErreur, sensibilite, conditionnementFactorisation LU avec pivotageSystemes bandeMatrices symetriquesFactorisation de CholeskyFactorisation QRGram-Schmidt ProcessProblemes aux moindres carresProblemes aux valeurs propresDecomposition en valeurs singulieres (SVD)
276/ 627
![Page 291: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/291.jpg)
Matrices symetriques
I A symetrique : on ne stocke que la triangulaire inferieure ousuperieure de A
I A = LU At = A↔ LU = UtLt Donc(U)(Lt)−1 = L−1Ut = D diagonale et U = DLt , soitA = L(DLt) = LDLt
I Exemple :
| 4 -8 -4| | 1 0 0 | | 1 0 0 | | 1 -2 -1 ||-8 18 14| = | -2 1 0 | * | 0 2 0 | * | 0 1 3 ||-4 14 25| | -1 3 1 | | 0 0 3 | | 0 0 1 |
I Resolution :1. A = LDLt
2. Ly = b3. Dz = y4. Ux = z
I LDLt : n3
3 flops (au lieu de 2n3
3 pour la LU)
277/ 627
![Page 292: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/292.jpg)
Matrices symetriques et pivotage
I pas de stabilite numerique sur A a priori → pivotageI maintien de la symetrie → pivotage diagonal, mais insuffisantI approches possibles: Aasen, Bunch & Kaufman, . . .I En general on cherche: PAPt = LDLt ou P matrice de
permutation L : triangulaire inferieureD : somme de matrices diagonales 1× 1 et 2× 2
| 1 0 0 0 | | x 0 0 0 | | 1 0 0 0 |t| x 1 0 0 | | 0 x x 0 | | x 1 0 0 |
PAPt= | x 0 1 0 | * | 0 x x 0 | * | x 0 1 0 || x x x 1 | | 0 0 0 x | | x x x 1 |
L D Lt
I Examples of 2x2 pivots:
| 0 1 | | eps1 1 || 1 0 | | 1 eps2 |
I Determination du pivot complexe: 2 colonnes a chaque etape
278/ 627
![Page 293: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/293.jpg)
I Let PAPt =
[E C t
C B
]. If E is a 2x2 pivot, form E−1 to get:
PAPt =
[I 0CE−1 I
] [E 00 B − CE−1C t
] [I E−1C t
0 I
]I Possible pivot selection algorithm (Bunch-Parlett):
µ1 = maxi |aii |; µ2 = maxij |aij |if µ1 ≥ αµ2 (for a given α > 0)
Choose largest 1x1 diag. pivot. Permute s.t. |e11| = µ1
else
Choose 2x2 pivot s.t. |e21| = µ2
I Choice of α to minimize growth factor, ie, the magnitude ofthe entries in B − CE−1C t , with E 1x1 or 2x2
I 1x1 pivot (µ1 ≥ αµ2), C has 1 column,|B − C 1
µ1C t |ij ≤ maxij |Bij |+ maxij(|cicj |/µ1) ≤ µ2 + µ2
2/µ1 =
µ2(1 + µ2/µ1) ≤ µ2(1 + 1/α)I 2x2 pivot, one can shot that bound is 3−α
1−αµ2
I Choose α s.t 3−α1−α = (1− 1
α )2 (2 pivots) gives. α = 1+√
178 .
I Unfortunately, previous algorithm requires between n2
12 and n2
6comparisons, and is too costly.
279/ 627
![Page 294: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/294.jpg)
I More efficient variants exist, also with a good backward errorI Example: Bunch-Kaufman algorithm (1977)
Determination of first pivot:α← (1 +
√17)/8 ≈ 0.64
r ← index of largest element colmax = |ar1| below the diagonalif |a11| ≥ α× colmax
1x1 pivot a11 is okelse
rowmax = |arp| =largest element in row rif rowmax× |a11| ≥ α× colmax2
1x1 pivot a11 is okelseif |arr | ≥ α× rowmax
1x1 pivot arr is ok, permuteelse
2x2 pivot
[a11 ar1
ar1 arr
]is chosen
interchange rows r and 2endif
endif
280/ 627
![Page 295: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/295.jpg)
Notions et techniques generales pour l’algebre lineaireIntroductionGaussian EliminationLU FactorizationVector and Matrix normsErreur, sensibilite, conditionnementFactorisation LU avec pivotageSystemes bandeMatrices symetriquesFactorisation de CholeskyFactorisation QRGram-Schmidt ProcessProblemes aux moindres carresProblemes aux valeurs propresDecomposition en valeurs singulieres (SVD)
281/ 627
![Page 296: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/296.jpg)
Factorisation de Cholesky
I A definie positive si x tAx > 0 ∀x 6= 0I A symetrique definie positive → factorisation de Cholesky
A = LLt avec L triangulaire inferieureI Par identification :[
A11 A12
A21 A22
]=
[L11 0L21 L22
]×[
L11 L21
0 L22
]I De la :
A11 = L211 → L11 = (A11)
12 (7)
A21 = L21 × L11 → L21 =A21
L11(8)
A22 = L221 + L2
22 → L22 = (A22 − L221)
12 (9)
. . . (10)
I Pas de pivotage, Cholesky est backward stableI Factorisation : ≈ n3
3 flops
282/ 627
![Page 297: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/297.jpg)
Algorithme de factorisation de type Cholesky
do k=1, nA(k,k)=sqrt(A(k,k))A(k+1:n,k) = A(k+1:n,k)/A(k,k)do j=k+1, n
A(j:n,j) = A(j:n,j) - A(j:n,k) A(j,k)end do
end do
I Schema similaire a la LU, mais on ne met a jour que letriangle inferieur
I LU factorization:
A(k+1:n,k+1:n) = A(k+1:n,k+1:n) / A(k,k)- A(k+1:n,k) * A(k,k+1:n,k)
283/ 627
![Page 298: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/298.jpg)
Notions et techniques generales pour l’algebre lineaireIntroductionGaussian EliminationLU FactorizationVector and Matrix normsErreur, sensibilite, conditionnementFactorisation LU avec pivotageSystemes bandeMatrices symetriquesFactorisation de CholeskyFactorisation QRGram-Schmidt ProcessProblemes aux moindres carresProblemes aux valeurs propresDecomposition en valeurs singulieres (SVD)
284/ 627
![Page 299: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/299.jpg)
Factorisation QR
I Definition d’un ensemble de vecteurs orthonormesx1, . . . , xk
I x ti xj = 0 ∀i 6= j
I x ti xi = 1
I Matrice orthogonale Q : les vecteurs colonnes de Q sontorthonormes, QQt = I , Q−1 = Qt
I Factorisation QR:
R
= QA
I Q orthogonale
I R triangulaire superieure
285/ 627
![Page 300: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/300.jpg)
Exemple
1 −82 −12 14
=
13 −2
3 −23
23 −1
323
23
23 −1
3
3 60 150 0
= Q × R
286/ 627
![Page 301: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/301.jpg)
Factorisation QR
I Factorisation QR obtenue en general par applicationssuccessives de transformations orthogonales sur les donnees :
Q = Q1 . . .Qn
ou Qi matrices orthogonales simples telles que QtA = R
I Transformations utilisees :I Reflexions de HouseholderI Rotations de GivensI Procede de Gram-Schmidt (auquel cas Q est de taille m× n et
R est de taille n × n)
287/ 627
![Page 302: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/302.jpg)
Reflexions de HouseholderH = I − 2v .v t ou v vecteur de IRn tq ‖v‖2 = 1H orthogonale symetrique.Permet en particulier d’annuler tous les elements d’un vecteur saufune composante.
I Exemple :
x =
2−1
2
u = x +
‖x‖2
00
=
5−1
2
et v =u
‖u‖2
Alors :
H = I − 2v × v t =1
15×
−10 5 −105 14 2
−10 2 11
Donc :
H × x =
−300
288/ 627
![Page 303: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/303.jpg)
Reflexions de Householder
xu
Hx
Vect u
289/ 627
![Page 304: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/304.jpg)
Reflexions de HouseholderVecteur de Householder : u = x + ou - ‖x‖2e1 puis u = v/‖v‖2
Permettent d’obtenir des matrices de la forme:
A =
a11 a12 a13
0 a22 a23
0 a32 a33
0 a42 a43
0 a52 a53
Soit H telle que :
H ×
a22
a32
a42
a52
=
a′22
000
Si H ′ =
[1 00 H
]Alors H ′ × A =
a11 a12 a13
0 a′22 a′23
0 0 a′33
0 0 a′43
0 0 a′53
290/ 627
![Page 305: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/305.jpg)
I Triangularisation d’une matrice 4× 3 : Q = H1 × H2 × H3
| x x x | | x x x | | x x x || x x x | H1 | 0 x x | H2 | 0 x x || x x x | -> | 0 x x | -> | 0 0 x || x x x | | 0 x x | | 0 0 x |
| x x x |H3 | 0 x x |-> | 0 0 x | = R
| 0 0 0 |
I QR backward stable, avec une erreur inverse meilleure que LU
I Nombree d’operations ≈ 43 n3
291/ 627
![Page 306: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/306.jpg)
Rotations de Givens
I Rotation 2× 2 :
G (θ) =
[c s−s c
]orthogonale
avec c = cos(θ) et s = sin(θ).
I Utilisation : x = x1, x2
c =x1
(x21 + x2
2 )12
et s =−x2
(x21 + x2
2 )12
y = (y1, y2) = G tx alors y2 = 0
I Permet d’annuler certains elements d’une matrice
292/ 627
![Page 307: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/307.jpg)
Rotations de Givens
I Exemple : factorisation QR de
A =
r11 r12 r13
0 r22 r23
0 0 r33
v1 v2 v3
I Determiner (c , s) tels que :[
c s−s c
]t
×[
r11
v1
]=
[r ′11
0
]
293/ 627
![Page 308: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/308.jpg)
Rotations de Givens
I Rotation dans le plan (1,4) :
G (1, 4) =
c 0 0 s0 1 0 00 0 1 0−s 0 0 c
G (1, 4)t × A =
r ′11 r ′12 r ′13
0 r22 r23
0 0 r33
0 v ′2 v ′3
I Rotations successives pour annuler les autres elements : 4n3
3flops pour triangulariser la matrice.
294/ 627
![Page 309: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/309.jpg)
Notions et techniques generales pour l’algebre lineaireIntroductionGaussian EliminationLU FactorizationVector and Matrix normsErreur, sensibilite, conditionnementFactorisation LU avec pivotageSystemes bandeMatrices symetriquesFactorisation de CholeskyFactorisation QRGram-Schmidt ProcessProblemes aux moindres carresProblemes aux valeurs propresDecomposition en valeurs singulieres (SVD)
295/ 627
![Page 310: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/310.jpg)
Gram-Schmidt Process
I Hypothesis: a basis of a subspace is available
I Goal: Build an orthonormal basis of that subspace
I Very useful in iterative methods, where:I each iterate is searched for in a subspace of increasing
dimensionI one needs to maintain a basis of good quality
296/ 627
![Page 311: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/311.jpg)
Gram-Schmidt Process
Consider two linearly independent vectors x1 and x2
I q1 =x1
‖x1‖2has norm 1.
I x2 − (x2, q1)q1 is orthogonal to q1:
(x2 − (x2, q1)q1, q1) = x t2q1 − (x t
2q1)qt1q1 = 0
I q2 =x2 − (x2, q1)q1
‖x2 − (x2, q1)q1‖2has norm 1
297/ 627
![Page 312: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/312.jpg)
Gram-Schmidt Process
1. Compute r11 = ‖x1‖2, if r11 = 0 stop
2. q1 =x1
r113. For j = 2, . . . , r Do
(q1 . . . , qj−1 form an orthogonal basis)4. rij ← x t
j qi , for i = 1, 2, . . . , j − 1
5. q ← xj −j−1∑i=1
rijqi
6. rjj = ‖q‖2, if rjj = 0 stop
7. qj ←q
rjj8. EndDo
298/ 627
![Page 313: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/313.jpg)
Remarks1. Compute r11 = ‖x1‖2, if r11 = 0 stop
2. q1 =x1
r113. For j = 2, . . . , r Do
(q1 . . . , qj−1 form an orthogonal basis)
4. rij ← xtj qi , for i = 1, 2, . . . , j − 1
5. q ← xj −j−1Xi=1
rijqi
6. rjj = ‖q‖2, if rjj = 0 stop
7. qj ←q
rjj8. EndDo
I From steps 5-7, it is clear that xj =∑j
i=1 rijqi
I We note X = [x1, x2, . . . , xr ] and Q = [q1, q2, . . . , qr ]
I Let R be the r -by-r upper triangular matrix whose nonzerosare the ones defined by the algorithm.
I Then the above relation can be written as
X = QR,
where Q is n-by-r and R is r -by-r .
299/ 627
![Page 314: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/314.jpg)
Example: x1 = (1, 2, 2)T , x2 = (−8,−1, 14)T
I r11 = ‖x1‖2 = 3,
q1 =x1
r11=
1
3
24 122
35 , r12 = xT2 q1 =
18
3= 6, and q = x2 − r12q1
q =
24 −8−114
35−6× 1
3
24 122
35 et r22 = ‖q‖ = 15, q2 =q
‖q‖ =1
3×
24 −2−1
2
35I Ce qui correspond a la factorisation :24 1 −8
2 −12 14
35 =
24 13− 2
323− 1
323
23
35× » 3 60 15
–=
24 13− 2
3− 2
323− 1
323
23
23− 1
3
3524 3 60 150 0
35Factorisation QR ou Q orthogonale et R triangulaire superieure
300/ 627
![Page 315: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/315.jpg)
Notions et techniques generales pour l’algebre lineaireIntroductionGaussian EliminationLU FactorizationVector and Matrix normsErreur, sensibilite, conditionnementFactorisation LU avec pivotageSystemes bandeMatrices symetriquesFactorisation de CholeskyFactorisation QRGram-Schmidt ProcessProblemes aux moindres carresProblemes aux valeurs propresDecomposition en valeurs singulieres (SVD)
301/ 627
![Page 316: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/316.jpg)
Problemes aux moindres carresSoit A : m × n, b ∈ IRn,m ≥ n (et le plus souvent m >> n)
I Probleme : trouver x tel que Ax = b
I Systeme sur-determine : existence de solution pas garantie.Donc on cherche la meilleure solution au sens d’une norme:
minx‖Ax − b‖2
SpanA
rb
Ax
I Principales approches:Equations normales ou factorisation QR
302/ 627
![Page 317: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/317.jpg)
Equations normales
minx‖Ax − b‖2 ↔ min
x‖Ax − b‖2
2
‖Ax − b‖22 = (Ax − b)t(Ax − b) = x tAtAx − 2x tAtb + btb
I Derivee nulle par rapport a x : 2AtAx − 2Atb = 0⇒ systeme de taille (n x n)
AtAx = Atb
I AtA symetrique semi-definie positive, definie positive si A estde rang maximal (rang(A)=n)
I resolution: avec Cholesky AtA = LDLt
probleme : κ(AtA) = κ(A)2
pas backward stable
I (AtA)−1At : pseudo-inverse de A
303/ 627
![Page 318: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/318.jpg)
Resolution par factorisation QR
Si Q est une matrice orthogonale :
‖Ax − b‖ = ‖Qt(Ax − b)‖ = ‖(QtA)x − (Qtb)‖
I A : m × n,Q : m ×m tel que A = QR
QtA = R =
[R1
0
]nm − n
R est triangulaire superieure. En posant :
Qtb =
[cd
]nm − n
I on a donc :
‖Ax − b‖22 = ‖QtAx − Qtb‖2
2 = ‖R1x − c‖22 + ‖d‖2
2
I si rang(A) = rang(R1)=n alors la solution est donnee parR1x = c
I nombre de flops ≈ (2n2 ×m)
304/ 627
![Page 319: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/319.jpg)
Notions et techniques generales pour l’algebre lineaireIntroductionGaussian EliminationLU FactorizationVector and Matrix normsErreur, sensibilite, conditionnementFactorisation LU avec pivotageSystemes bandeMatrices symetriquesFactorisation de CholeskyFactorisation QRGram-Schmidt ProcessProblemes aux moindres carresProblemes aux valeurs propresDecomposition en valeurs singulieres (SVD)
305/ 627
![Page 320: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/320.jpg)
Problemes aux valeurs propres
I Resolution de Ax = λx ou λ valeurs propres et x vecteurspropres
I Polynome caracteristique : p(λ) = det(A− λI ) (revient achercher λ tel que A− λI singuliere)
I Soit T non singuliere et Ax = λx
(T−1AT )(T−1x) = λ(T−1x)
A et (T−1AT ) sont des matrices dites similaires, elles ontmemes valeurs propres.T : transformation de similarite
306/ 627
![Page 321: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/321.jpg)
Problemes aux valeurs propres
On prend T = Q, orthogonale
I A← QtAQ est tres interessant
I backward stable avec des transformations de Householder ouGivens
I QtAQ similaire a (A + E ) avec ‖E‖ ≈ u × ‖A‖I On cherche donc a determiner Q tel que valeurs propres de
QtAQ evidentes
307/ 627
![Page 322: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/322.jpg)
Exemple
A matrice 2× 2 de valeurs propres reellesOn peut toujours trouver (c , s) tel que :[
c s−s c
]t
×[
a11 a12
a21 a22
]×[
c s−s c
]=
[λ1 t
0 λ2
]= S
λ1 et λ2 sont les valeurs propres de A : decomposition de Schur
I Si y est vecteur propre de S alors x = Qy est vecteur proprede A
I Sensibilite d’une valeur propre aux perturbations fonction del’independance de son vecteur propre par rapport aux vecteurspropres des autres valeurs propres
308/ 627
![Page 323: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/323.jpg)
Valeurs propres: methodes iteratives
Methode de la puissance
vn+1 = Avn/||Avn||
avec v0 pris au hasard
I converge vers v tel que Av = λ1v (|λ1| > |λ2| ≥ ... ≥ |λn|)I Preuve:
- si v0 =∑αixi avec (xi ): base de vecteurs propre alors
- Akv = Ak(∑
αixi ) =n∑
i=1
αiλki xi =
α1λk1(x1 +
n∑i=2
αj
α1(λi
λ1)kxi )
- avec ( λiλ1
)k → 0
309/ 627
![Page 324: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/324.jpg)
Valeurs propres: methodes iteratives
Shift-and-invert
(A− µI )vk+1 = vk
I methode de la puissance appliquee a (A− µI )−1
I permet d’obtenir la valeur propre la plus proche de µ
I factorisation (par exemple, LU) de (A− µI )
I a chaque iteration: Ly = vk , puis Uvk+1 = y
Des ameliorations existent pour accelerer la convergence (Lanczos,. . . ).
309/ 627
![Page 325: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/325.jpg)
Notions et techniques generales pour l’algebre lineaireIntroductionGaussian EliminationLU FactorizationVector and Matrix normsErreur, sensibilite, conditionnementFactorisation LU avec pivotageSystemes bandeMatrices symetriquesFactorisation de CholeskyFactorisation QRGram-Schmidt ProcessProblemes aux moindres carresProblemes aux valeurs propresDecomposition en valeurs singulieres (SVD)
310/ 627
![Page 326: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/326.jpg)
Decomposition en valeurs singulieres (SVD)
I A ∈ IRm×n, alors il existe U et V matrices orthogonales tellesque :
A = UΣV t
decomposition en valeurs singulieres.
I Remarque:
AtA = V Σ2V t et AAt = UΣUt
I U ∈ IRm×m formee des m vecteurs propres orthonormesassocies aux m plus grandes valeurs propres de AAt .
I V ∈ IRn×n formee des n vecteurs propres orthonormes associesaux valeurs propres de AtA
I Σ matrice diagonale constituee des valeurs singulieres de Aqui sont les racines carrees des valeurs propres de AtA (tqσ1 ≥ σ2 ≥ . . . ≥ σn).
311/ 627
![Page 327: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/327.jpg)
I Si A est de rang r < n, alors sr+1 = sr+2 = . . . = sn = 0.I Tres utile dans certaines applications lorsque rang(A) pas
maximalI moindres carres,I valeurs propresI determination precise du rang d’une matrice
312/ 627
![Page 328: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/328.jpg)
Outline
Efficient dense linear algebra librariesUse of scientific librariesLevel 1 BLAS and LINPACKBLASLU FactorizationLAPACKLinear algebra for distributed memory architecturesBLACS (Basic Linear Algebra Communication Subprograms)PBLAS : parallel BLAS for distributed memory machinesScaLAPACKRecursive algorithms
313/ 627
![Page 329: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/329.jpg)
Efficient dense linear algebra librariesUse of scientific librariesLevel 1 BLAS and LINPACKBLASLU FactorizationLAPACKLinear algebra for distributed memory architecturesBLACS (Basic Linear Algebra Communication Subprograms)PBLAS : parallel BLAS for distributed memory machinesScaLAPACKRecursive algorithms
314/ 627
![Page 330: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/330.jpg)
Use of scientific libraries
(a) Robustness
(b) Efficiency
(c) Portability
(d) Usable on a wide range of applications
(a)+(b)+(c) should be true for all scientific software
I Robustness:I Reliability of the computations (backward stable algorithms)I In particular if input is far from an underflow/overflow
threshold, the code should not produce underflow/overflow.
315/ 627
![Page 331: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/331.jpg)
I Efficiency:I Good performanceI No performance degradation for large-scale problemsI Time for execution should not vary too much for problems of
identical size
I Portability :I Code should be written in a standard languageI Source code can be compiled on an arbitrary machine with an
arbitrary compilerexecution should be correct (robustness) and efficient
I Wide range of applications:I Can be used on several problems/data structures (example:
matrices in BLAS library can be dense, symmetric, packed,band)
316/ 627
![Page 332: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/332.jpg)
Use of scientific libraries and parallelism
Two main models for parallelism:I shared address space (example: multi-processor or multi-core
workstation):I all processors have access to the same logical memoryI works like POSIX threads, system maps threads to different
cores/processorsI parallelism can be transparent to the user of the libraryI standards: POSIX thread, OpenMP
I distributed memory model (example: cluster)I each processor has its own memoryI each processor has a network interfaceI communication and synchronization require message passingI standards: PVM, MPI
317/ 627
![Page 333: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/333.jpg)
Example: dot product on 2 processors – sharedmemory
(dot = 0 initially)thread 1 on proc 1
loc s1 = 0do i = 1, n/2loc s1 = loc s1 + x(i) *
y(i)enddodot = dot + loc s1
thread 2 on proc 2
loc s2 = 0do i = n/2+1, nloc s2 = loc s2 + x(i) *
y(i)enddodot = dot + loc s2
Result could be wrong
I problem: dot = dot + loc s is not atomic
I possible solution: mutual exclusion with locks (criticalsections)
318/ 627
![Page 334: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/334.jpg)
Example: dot product on 2 processors – sharedmemory
(dot = 0 initially)thread 1 on proc 1
loc s1 = 0do i = 1, n/2loc s1 = loc s1 + x(i) *
y(i)enddodot = dot + loc s1
thread 2 on proc 2
loc s2 = 0do i = n/2+1, nloc s2 = loc s2 + x(i) *
y(i)enddodot = dot + loc s2
Result could be wrong
I problem: dot = dot + loc s is not atomic
I possible solution: mutual exclusion with locks (criticalsections)
318/ 627
![Page 335: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/335.jpg)
Example: dot product on 2 processors – sharedmemory
(dot = 0 initially)thread 1 on proc 1
loc s1 = 0do i = 1, n/2loc s1 = loc s1 + x(i) *
y(i)enddolockdot = dot + loc s1unlock
thread 2 on proc 2
loc s2 = 0do i = n/2+1, nloc s2 = loc s2 + x(i) *
y(i)enddolockdot = dot + loc s2unlock
Result could be wrong
I problem: dot = dot + loc s is not atomic
I possible solution: mutual exclusion with locks (criticalsections)
318/ 627
![Page 336: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/336.jpg)
Dot product on 2 processors – Message Passing
Suppose that initially:
I p1 owns x(1:n/2) and y(1:n/2)
I p2 owns x(n/2+1:n) and y(n/2+1:n)Processor 1:
s loc = dot seq ( x(1:n/2),y(1:n/2) )
send s loc to P2receive s remote from P2s=s loc + s remote
Processor 2:s loc = dot seq (x(n/2+1:n),
y(n/2+1:n))send s loc to P1receive s remote from P1s=s loc + s remote
Correctness depends on send/receive protocols
I asynchronous: ok
I rendezvous protocol: deadlock
319/ 627
![Page 337: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/337.jpg)
Dot product on 2 processors – Message Passing
Suppose that initially:
I p1 owns x(1:n/2) and y(1:n/2)
I p2 owns x(n/2+1:n) and y(n/2+1:n)Processor 1:
s loc = dot seq ( x(1:n/2),y(1:n/2) )
send s loc to P2receive s remote from P2s=s loc + s remote
Processor 2:s loc = dot seq (x(n/2+1:n),
y(n/2+1:n))send s loc to P1receive s remote from P1s=s loc + s remote
Correctness depends on send/receive protocols
I asynchronous: ok
I rendezvous protocol: deadlock
319/ 627
![Page 338: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/338.jpg)
Calling parallel libraries
shared memory parallelism
I Parallelism (threads) can be created inside the callI threadsI OpenMP standard
I Can be transparent for the user
distributed memory parallelism
I Each processor executes a program
I Each processor calls the library function (SPMD)I Data distribution must be specified in the API
I data replicated on the processors, (large mem usage)I data only on one (master) processor initially, (bottleneck on
master)I chunks of data on each processor.
There exist other parallel programming models (SIMD or dataparallel, BSP, mixed shared-distributed programming,. . . )
320/ 627
![Page 339: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/339.jpg)
Parallelism and portable libraries
I Historically: each parallel machine was unique, along with itsprogramming model and programming language
I For each new type of machine, start development again
I Now distinguish between programming model from theunderlying machine, so we can write portably correct code
I shared memory: OpenMP directives above threads (loopparallelism, . . . )
I distributed memory: MPI most portable
321/ 627
![Page 340: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/340.jpg)
Efficient dense linear algebra librariesUse of scientific librariesLevel 1 BLAS and LINPACKBLASLU FactorizationLAPACKLinear algebra for distributed memory architecturesBLACS (Basic Linear Algebra Communication Subprograms)PBLAS : parallel BLAS for distributed memory machinesScaLAPACKRecursive algorithms
322/ 627
![Page 341: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/341.jpg)
Level 1 BLAS and LINPACK
I First effort to define a standard forI basic vector operations used in linear algebra (BLAS, later
called BLAS 1)I a portable package to solve systems of linear equations
(LINPACK)
I LINPACK/BLAS1 standard was defined in 1979
I Goals/motivations:
I ease the concetpion of numerical codesI better readabilityI efficiency: optimized versions or assemblerI robustness, reliability and portability improved
(standardization)
I Level 1 BLAS used in LINPACK
323/ 627
![Page 342: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/342.jpg)
Linpack performance (MFlops)
Computer Peak perf Effective perf Efficiency
ALLIANT FX/2800 (14 proc) 560 31 0.06CONVEX C-210 50 17 0.34CONVEX C-3810 (1 proc) 106 37 0.35CONVEX C-240 (4 proc) 126 27 0.21CRAY-XMP-1 235 70 0.28CRAY-XMP-4 (4 proc) 940 178 0.22CRAY-2 (4 proc) 1951 129 0.066CRAY-YMP-1 333 161 0.48CRAY-YMP-8 (8 proc) 2664 275 0.10CRAY C-90 (1 proc) 1000 326 0.33FUJITSU VP 2600/10 5000 249 0.05HITACHI S-820/80 3000 107 0.036IBM RS/6000-530 50 13 0.26IBM RS/6000-550 83 27 0.34NEC SX-2 1300 43 0.033NEC SX-3 5500 314 0.06
324/ 627
![Page 343: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/343.jpg)
Why performance is small ?
I Memory contentionIn LINPACK, the main kernel used is:SAXPY : y ← y + α× xSAXPY : 2 loads, multiplication, addition, et storeRatio flops/memory ref = 2/3Does not allow for an efficient use of memory hierarchy (dataare not reused)
I ObjectiveIncrease flops/memory ref ratio
I How ?Re-use several times data that are in scalar/vector registers, inlow-level cache
→ definition of higher level BLAS (matrix operations)
325/ 627
![Page 344: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/344.jpg)
Efficient dense linear algebra librariesUse of scientific librariesLevel 1 BLAS and LINPACKBLASLU FactorizationLAPACKLinear algebra for distributed memory architecturesBLACS (Basic Linear Algebra Communication Subprograms)PBLAS : parallel BLAS for distributed memory machinesScaLAPACKRecursive algorithms
326/ 627
![Page 345: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/345.jpg)
BLAS library
BLAS : Basic Linear Algebra Subprograms3 levels:
I BLAS1 : vector-vector operations - complexity O(n)
I BLAS2 : matrix-vector operations - complexity O(n2)
I BLAS3 : matrix-matrix operation - complexity O(n3)
typical memoryoperation ] flop access ratio
BLAS11979 y = αx + y 2n 3n + 1 2
3
BLAS21988 y = αAx + βy 2n2 n2 + 3n 2
BLAS31990 C = αAB + βC 2n3 4n2 n
2
327/ 627
![Page 346: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/346.jpg)
BLAS performance
328/ 627
![Page 347: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/347.jpg)
BLAS BenefitsThe BLAS offer several benefits
1. Robustness:low level details (treatment of exception like overflow arehandled by the library).
2. Portability/Efficiency:thanks to the standardization of the API. Machine dependentoptimization are left to the vendors/system administrator.Nowadays, available on all scientific computers.
3. Readability:modular description of the mathematical algortithms(Matlab-like).
The subroutines are available for the four standard arithmetics
1. single real: prefix S,
2. double real: prefix D,
3. single complex: prefix C,
4. double complex: prefix Z.
![Page 348: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/348.jpg)
BLAS 1: quick overview
scal x = αx axpy y = αx + yswap y ↔ x copy y = xdot dot = xT y nrm2 nrm2 = ‖x‖2
min, max search generating and applying plane rotationscall DAXPY(N, ALPHA, X, INCX, Y, INCY)
330/ 627
![Page 349: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/349.jpg)
BLAS 2: quick overview
α, β are scalar, x , y are vectors, A is a general matrix, T is atriangular matrix and H an Hermitian matrix.
I Matrix-vector product
y = αAx + βy y = αAT x + βy y = αAHx + βy
x = Tx x = T T x x = T Hx
I Rank-one and rank-two update
A = αxyT + A A = αxy t + αyx t + A
H = αxxH + H H = αxyH + αyxH + H
I Solution of triangular systems
x = T−1x x = T−T x x = T−Hx
call DGEMV(TRANS, M, N, ALPHA, A, LDA, X, INCX, BETA, Y, INCY)
![Page 350: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/350.jpg)
BLAS 2: naming scheme
1. first character: data type (S, D, C, Z)
2. characters 2 and 3: matrix type
I GE : general matrix.I GB : general band matrix.I HE : Hermitian matrix.I SY : symmetric matrix.I SP : symmetric matrix in ”packed” format.I HP : Hermitian matrix in ”packed” format.I HB : Hermitian band matrix.I SB : symmetric band matrix.I TR : triangular matrix.I TP : triangular matrix in ”packed” format.I TB : triangular band matrix.
3. characters 4 and 5: operation type
I MV : matrix-vector product y = αAx + y .I R : rank-one update A = A + αxyT .I R2 : rank-two update A = A + αxyT + αyxT .I SV : triangular system solution x = T−1x .
![Page 351: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/351.jpg)
BLAS 3: quick overviewA,B,C are general matrices and T is a triangular matrix.
I Matrix-matrix product
C = αAB + βC C = αAT B + βC
C = αABT + βC C = αAT BT + βC
I Rank-k and 2k updates of a symmetric matrix
C = αAAT + βC C = αAT A + βC
C = αAT B + αABT + βC C = αABT + αBAT + βC
I Multiply a matrix by a triangular matrix
B = αTB B = αT T B
B = αBT B = αBT T
I Solving triangular systems with multiple right-hand sides
B = αT−1B B = αT−T B
B = αBT−1 B = αBT−T
call DGEMM(TRANSA, TRANSB, M, N, K, ALPHA, A, LDA, B, LDB, BETA, C, LDC)
![Page 352: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/352.jpg)
BLAS 3: naming scheme
1. first character: data type (S, D, C, Z)
2. characters 2 and 3 : matrix type
I GE : general matrix.I HE : Hermitian matrix.I SY : symmetric matrix.I TR : triangular matrix.
3. characters 4 and 5 : operation type
I MM : matrix-matrix product C = αAB + βC .I RK : rank-k update of a symmetric or Hermitian matrix
C = αAAT + βC .I R2K : rank-2k update of symmetric or Hermitian matrix
C = αABT + αBAT + βCI SM : solution of a triangular system with multiple right-hand
sides B = T−1B.
![Page 353: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/353.jpg)
Performance of the BLAS
I Today’s processors can achieve high performance, but thisrequires extensive machine-specific hand tuning.
I Routines have a large design space with many parameters:Blocking sizes, loop nesting permutations, loop unrollingdepths, ...
I Complicated interactions with the increasingly sophisticatedmicro-architectures of modern microprocessors.
I Need for quick/dynamic deployment of optimized routines.I ATLAS - Automatically Tuned Linear Algebra Software.I PhiPac from Berkeley.
I More recent approach:I Flame/Goto BLAS from Univ. Texas at AustinI Main idea: minimize TLB misses
335/ 627
![Page 354: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/354.jpg)
Optimized Blas
Peak performance of the Power 4 : 3.5 GFlops
![Page 355: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/355.jpg)
Optimized Blas
Peak performance of the Itanium 2 : 3.7 GFlops
![Page 356: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/356.jpg)
Atlas performance
J. Dongarra figures
![Page 357: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/357.jpg)
Conclusion on BLAS
I standard ⇒ used to design portable and efficient codes
I optimized BLAS libraries available
Never code vector/matrix operations yourself, always rely onoptimized kernels
I parallel BLAS kernels (exploit parallelism inside BLASroutines)
I Pros: portability + parallelism is hidden to the userI Cons: not always the most efficient way to parallelize an
applicationI frequent on multicores/multiprocessors with shared memoryI rare on multiprocessors with virtual shared memoryI distributed memory: PBLAS and message passing
339/ 627
![Page 358: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/358.jpg)
Parallel BLAS performance
I Exists since a long time
Computer Prec 1 proc. # procs1 2 4 8 16 24
BBN TC2000 32 bits 7.8 6.6 13.4 26.2 52.1 98.8 124.464 bits 2.7 2.5 4.9 9.7 19.2 37.2 47.0
KSR1 64 bits 27.5 25.4 42.9 81.9 165.4 305.4 418.3
Table: Performance in MFlops of GEMM using square matrices oforder 512 on BBN TC2000 and KSR1.
I Several BLAS libraries (ATLAS, GOTO Blas, vendors’ BLAS)provide threaded parallelism that can be efficiently exploitedon SMP or multi-core architectures
340/ 627
![Page 359: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/359.jpg)
Efficient dense linear algebra librariesUse of scientific librariesLevel 1 BLAS and LINPACKBLASLU FactorizationLAPACKLinear algebra for distributed memory architecturesBLACS (Basic Linear Algebra Communication Subprograms)PBLAS : parallel BLAS for distributed memory machinesScaLAPACKRecursive algorithms
341/ 627
![Page 360: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/360.jpg)
Factorisation LU
I Solution of Ax = bI factorization PA = LU with P: permutation matrixI forward-backward substitution:
I Ly = PbI Ux = y
I Gaussian elimination:
do .........do ......do ......
a(i,j) = a(i,j) - a(i,k) * a(k,j) / a(k,k)end do
end doend do
Order of nested loops → 6 alternatives, 3 column-oriented
342/ 627
![Page 361: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/361.jpg)
Column-oriented variants
I KJI- SAXPY (right-looking)
do k=1, n-1do j=k+1, ndo i=k+1, n
a(i,j) = a(i,j) - a(i,k) * a(k,j) / a(k,k)end do
end doend do
Note: divisions of the columns by the pivot are omitted.
343/ 627
![Page 362: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/362.jpg)
Column-oriented variants
I JKI- GAXPY (left-looking)
do j=2, ndo k=1, j-1do i=k+1, n
a(i,j) = a(i,j) - a(i,k) * a(k,j) / a(k,k)end do
end doend do
Note: divisions of the columns by the pivot are omitted.
344/ 627
![Page 363: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/363.jpg)
Column-oriented variants
I JIK- SDOT
do j=2, ndo i=2, ndo k=1, min(i,j)-1
a(i,j) = a(i,j) - a(i,k) * a(k,j) / a(k,k)end do
end doend do
Note: divisions of the columns by the pivot are omitted.
345/ 627
![Page 364: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/364.jpg)
LU factorization: Crout variantAt each step K, build Kth row and Kth column.
do K=2, n! build row K (=i)i=Kdo k=1,K-1do j=K, n
a(i,j) = a(i,j) - a(i,k) * a(k,j) / a(k,k)end do
end do
! build column K (=j)j=Kdo k=1, K-1do i=K+1, n
a(i,j) = a(i,j) - a(i,k) * a(k,j) / a(k,k)end do
end doend do
346/ 627
![Page 365: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/365.jpg)
Blocked algorithms
(Key to get high performance)
A11 A12 A13
A21 A22 A23
A31 A32 A33
=
L11
L21 L22
L31 L32 L33
U11 U12 U13
U22 U23
U33
.
That is equivalent (equating terms with A):
A11 = L11U11, A12 = L11U12, A13 = L11U13,A21 = L21U11, A22 = L21U12 + L22U22, A23 = L21U13 + L22U23,A31 = L31U11, A32 = L31U12 + L32U22, A33 = L31U13 + L32U23 + L33U33.
347/ 627
![Page 366: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/366.jpg)
Various variants
A11 = L11U11, A12 = L11U12, A13 = L11U13,A21 = L21U11, A22 = L21U12 + L22U22, A23 = L21U13 + L22U23,A31 = L31U11, A32 = L31U12 + L32U22, A33 = L31U13 + L32U23 + L33U33.
Postponing some updates and changing the order in which they arecomputed, leads to different variants.
Left looking Right looking Crout (i,j,k variant)
348/ 627
![Page 367: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/367.jpg)
Left looking LU
A11 = L11U11, A12 = L11U12, A13 = L11U13,A21 = L21U11, A22 = L21U12 + L22U22, A23 = L21U13 + L22U23,A31 = L31U11, A32 = L31U12 + L32U22, A33 = L31U13 + L32U23 + L33U33.
Step 1:
L11, U11
L21
L31
= LU
A11
A21
A31
A12 A13
A22 A23
A32 A33
![Page 368: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/368.jpg)
Left looking LU
A11 = L11U11, A12 = L11U12, A13 = L11U13,A21 = L21U11, A22 = L21U12 + L22U22, A23 = L21U13 + L22U23,A31 = L31U11, A32 = L31U12 + L32U22, A33 = L31U13 + L32U23 + L33U33.
Step 2:
L11 U12 = L−111 A12 A13
L21
L31
[L22, U22
L32
]= LU
((A22
A32
)−(
L21
L31
)U12
)A23
A33
![Page 369: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/369.jpg)
Left looking LU
A11 = L11U11, A12 = L11U12, A13 = L11U13,A21 = L21U11, A22 = L21U12 + L22U22, A23 = L21U13 + L22U23,A31 = L31U11, A32 = L31U12 + L32U22, A33 = L31U13 + L32U23 + L33U33.
Step 3:
L11
L21 L22
(U13
U23
)=
(L11
L21 L22
)−1(A13
A23
)L31 L32 [L33U33] = LU
(A33 −
(L31 L32
)( U13
U23
))
![Page 370: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/370.jpg)
Right-looking LU
A11 = L11U11, A12 = L11U12, A13 = L11U13,A21 = L21U11, A22 = L21U12 + L22U22, A23 = L21U13 + L22U23,A31 = L31U11, A32 = L31U12 + L32U22, A33 = L31U13 + L32U23 + L33U33.
Step 1:
0B@ [L11,U11] = LU (A11) (U12 U13) = L−111
`A12 A13
´„L21
L31
«=
„A21
A31
«U−1
11
A
(1)22 A
(1)23
A(1)32 A
(1)33
!=
„A22 A23
A32 A33
«−„
L21
L31
«`U12 U13
´1CA
![Page 371: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/371.jpg)
Right-looking LU
Step 2:
(
[L22,U22] = LU(
A(1)22
)U23 = L−1
22 A(1)23
L32 = A132U−1
22 A(2)33 = A
(1)22 − L32U23
)
Step 3:
[L33,U33] = LU(
A(2)33
)
351/ 627
![Page 372: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/372.jpg)
Crout LU
A11 = L11U11, A12 = L11U12, A13 = L11U13,A21 = L21U11, A22 = L21U12 + L22U22, A23 = L21U13 + L22U23,A31 = L31U11, A32 = L31U12 + L32U22, A33 = L31U13 + L32U23 + L33U33.
Step 1:
[L11,U11] = LU (A11)(
U12 U13
)= L−1
11
(A12 A13
)(L21
L31
)=
(A21
A31
)U−1
11
A22 A23
A32 A33
![Page 373: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/373.jpg)
Crout LU
Step 2:
( (A
(1/2)22 A
(1/2)23
)= (A22A23)− L21 (U12U13)
A(1/2)32 = A32 − L31U12 A33
)
( [L22, U22
L32
]= LU
(A
(1/2)22
A(1/2)32
)U23 = L−1
22 A(1/2)23
A33
)
353/ 627
![Page 374: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/374.jpg)
Crout LU
Step 3:
(A
(1/2)33 = A33 − (L31L32)
(U13
U23
))[L33U33] = LU(A
(1/2)33 )
354/ 627
![Page 375: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/375.jpg)
Performance of blocked algorithms
n 100 500 1000 1500
F77 loops 0.0240 2.87 30.19 105.82BLAS 1 0.0057 0.40 11.97 44.81BLAS 3 0.0021 0.18 1.42 4.68
Elapsed time (seconds) of Cholesky factorization on SGI O2K
n3 growth for the Fortran loops (Cholesky complexity n3
3 flops).
355/ 627
![Page 376: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/376.jpg)
Use of BLAS 3 in LU factorization
All these block algorithms can be expressed using BLAS 3 kernelsExample: LU right-looking (KJI-SAXPY):
Bk Ck
UkLk At each step:
1. Unblocked (pivoting) factorizationof Bk
2. Compute row block Uk : TRSM
3. Update submatrix Ck : GEMM
I All variants → same number of flops
I Different memory access
I Efficiency depends on relative BLAS 3 efficiency
356/ 627
![Page 377: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/377.jpg)
BLAS 3 operations (n=500, nb=64) for LU variants
Variant Routine % Operations % Time Avg. MFlops
Left-Looking DGEMM 49 32 438DTRSM 41 45 268unblock LU 10 20 146
Right-Looking DGEMM 82 56 414DTRSM 8 23 105unblock LU 10 19 151
Crout DGEMM 82 57 438DTRSM 8 24 105unblock LU 10 16 189
357/ 627
![Page 378: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/378.jpg)
Other block algorithms
Most of the linear algebra algorithms can be recasted in blockvariants:
I Linear systemsSymmetric positive definite (LLT ), symmetric indefinite(LDLT ).
I Eigensolvers
I Linear least-squaresQR decomposition based on Householder transformations.
358/ 627
![Page 379: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/379.jpg)
Example of performance with parallel BLAS
Speed-up on 8 processors vs. 1 processor with 1000x1000 matrices,CRAY YMP.
I Factorisation LU : 6.1
I Factorisation Cholesky : 6.2
I Factorisation LDLT : 5.3
I Factorisation QR : 6.8
359/ 627
![Page 380: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/380.jpg)
Efficient dense linear algebra librariesUse of scientific librariesLevel 1 BLAS and LINPACKBLASLU FactorizationLAPACKLinear algebra for distributed memory architecturesBLACS (Basic Linear Algebra Communication Subprograms)PBLAS : parallel BLAS for distributed memory machinesScaLAPACKRecursive algorithms
360/ 627
![Page 381: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/381.jpg)
LAPACK: Linear Algebra PACKageScientific library developed in Fortran 77 intensively using BLAS 3routines (600 000 lines of code).
I Supersedes LINPACK (Ax = b) and EISPACK (Ax = λx)
I Scope:I linear equations,I linear least-squares,I standard eigenvalue and singular value problems,I generalized eigenvalue problems.
I Components:I driver routines:
solve the complete problem (e.g. solve a linear sytem);I expert routines:
similar to driver but provides the users with more numericalinformation (e.g. estimate condition number of a matrix, onlycompute a subset of eigenpairs, etc .)
I computational routines:that perform a distinct computational task (e.g. LU or QRfactorization).
361/ 627
![Page 382: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/382.jpg)
The LAPACK library
I Good numerical robustness (rely on ”clean” IEEE arithmetic)
I First public release: 1991. Available on netlib
I Latest release: 3.1.1, February 2007
I Main credits: Cray research, Univ. Kentucky, Univ. ofTennessee, Courant Institute, NAG Ltd, Rice Univ., ArgonneNat. Lab., Oak Ridge Nat. Lab.
I Parallel implementation on shared memory/multicoresinherited from parallel BLAS (efficiency limited on largenumbers of processors/cores)
I Evolution for distributed memory multiprocessors:ScaLAPACK
362/ 627
![Page 383: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/383.jpg)
The LAPACK library
I Good numerical robustness (rely on ”clean” IEEE arithmetic)
I First public release: 1991. Available on netlib
I Latest release: 3.1.1, February 2007
I Main credits: Cray research, Univ. Kentucky, Univ. ofTennessee, Courant Institute, NAG Ltd, Rice Univ., ArgonneNat. Lab., Oak Ridge Nat. Lab.
I Parallel implementation on shared memory/multicoresinherited from parallel BLAS (efficiency limited on largenumbers of processors)
I Evolution for distributed memory multiprocessors:ScaLAPACK
362/ 627
![Page 384: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/384.jpg)
Case Study: Cholesky Factorization on multicores
(slides from Alfredo Buttari, Jack Dongarra, Jakub Kurzak andJulien Langou)
363/ 627
![Page 385: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/385.jpg)
Parallelism in LAPACK: Cholesky factorizationParallelism in LAPACK: Cholesky factorization
DPOTF2: BLAS-2non-blocked factorization of the panel
DTRSM: BLAS-3updates by applying the transformation computed in DPOTF2
DGEMM (DSYRK): BLAS-3updates trailing submatrix
U=LT
![Page 386: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/386.jpg)
Parallelism in LAPACK: Cholesky factorizationParallelism in LAPACK: Cholesky factorization
BLAS2 operations cannot be efficiently parallelized because they are bandwidth bound.
• strict synchronizations• poor parallelism• poor scalability
![Page 387: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/387.jpg)
Parallelism in LAPACK: Cholesky factorizationParallelism in LAPACK: Cholesky factorization
The execution flow if filled with stalls due to synchronizations and sequential operations.
Time
![Page 388: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/388.jpg)
Parallelism in LAPACK: Cholesky factorizationParallelism in LAPACK: Cholesky factorization
do DPOTF2 on
for all do DTRSM on end
for all do DGEMM on end
end
Tiling operations:
![Page 389: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/389.jpg)
Parallelism in LAPACK: Cholesky factorizationParallelism in LAPACK: Cholesky factorization
Cholesky can be represented as a Directed Acyclic Graph (DAG) where nodes are subtasks and edges are dependencies among them.
As long as dependencies are not violated, tasks can be scheduled in any order.
3:3 4:3
3:2 4:2
2:2
2:2 3:2 4:2
2:1 3:1 4:1
1:1
4:2 4:3
1:1
2:1 2:2
3:1
4:1
3:33:2
5:1 5:2 5:3 5:4 5:5
4:4
![Page 390: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/390.jpg)
Time
Parallelism in LAPACK: Cholesky factorizationParallelism in LAPACK: Cholesky factorization higher flexibility some degree of adaptativity no idle time better scalability
Cost:
1 /3n3
n3
2n3
![Page 391: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/391.jpg)
Parallelism in LAPACK: block data layoutParallelism in LAPACK: block data layout
Column-Major Block data layout
![Page 392: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/392.jpg)
Column-Major
Parallelism in LAPACK: block data layoutParallelism in LAPACK: block data layout
Block data layout
![Page 393: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/393.jpg)
64 128 2560
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Blocking Speedup
DGEMM
DTRSM
block size
speedup
The use of block data layout storage can significantly improve performance
Parallelism in LAPACK: block data layoutParallelism in LAPACK: block data layout
![Page 394: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/394.jpg)
Cholesky: performance Cholesky: performance
0 2000 4000 6000 8000 100000
5
10
15
20
25
30
35
40
45
50
55
Cholesky -- Dual Clovertown
async. 2d blockingLAPACK + Th. BLAS
problem size
Gflop
/s
![Page 395: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/395.jpg)
Cholesky: performance Cholesky: performance
0 2500 5000 7500 10000 12500 150002.5
5
7.5
10
12.5
15
17.5
20
22.5
25
27.5
30
32.5
35
Cholesky -- 8-way Dual Opteron
async. 2d blockingLAPACK + Th. BLAS
problem size
Gflop/s
![Page 396: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/396.jpg)
Efficient dense linear algebra librariesUse of scientific librariesLevel 1 BLAS and LINPACKBLASLU FactorizationLAPACKLinear algebra for distributed memory architecturesBLACS (Basic Linear Algebra Communication Subprograms)PBLAS : parallel BLAS for distributed memory machinesScaLAPACKRecursive algorithms
375/ 627
![Page 397: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/397.jpg)
Linear algebra for distributed memory architectures
I Difficulties:I Distribute data on the processorsI Define enough parallel tasks but not too manyI Explicit message passing between processors
I Example: LU factorization
Suppose thatI Each processor holds part of the matrixI Each processor performs the update operations on its partI Which data distribution should be used ?
376/ 627
![Page 398: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/398.jpg)
Data distribution for dense matrices
41 2 3 4 1 2 3
1D block-cyclic
3
0 1
2 3
0 1
2 3
0 1
2 3
0 1
2 3
0 1
2 3
0 1
2 3
0 1
2 3
0 1
2 3
0 1
2 3
0 1
2 3
0 1
2 3
0 1
2 3
0 1
2 3
0 1
2 3
0 1
2 3
0 1
2
2D block-cyclic
Main reasons for 2D block-cyclic:
1. good load balance, minimize communication,
2. use of BLAS 3 on each processor.
Need to communicate blocks of matrices between processors: BLACS
377/ 627
![Page 399: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/399.jpg)
Efficient dense linear algebra librariesUse of scientific librariesLevel 1 BLAS and LINPACKBLASLU FactorizationLAPACKLinear algebra for distributed memory architecturesBLACS (Basic Linear Algebra Communication Subprograms)PBLAS : parallel BLAS for distributed memory machinesScaLAPACKRecursive algorithms
378/ 627
![Page 400: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/400.jpg)
BLACS (Basic Linear Algebra CommunicationSubprograms) (User’s Guide, J. J. Dongarra, R. C.Whaley)
I Set of communication routines to imlpement linear algebraalgorithms on distributed memory architectures
I Portable
I available on top of MPI or PVM, on CMMD (ThinkingMachine), MPL (IBM SPx), NX (Intel), . . .
I SPMD modelI Main concept: communication based on 2D arrays:
I MxN rectangular matricesI trapezoidal matrices
379/ 627
![Page 401: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/401.jpg)
UPLO
’U’
’L’
M <= N M > N
n
m
n−m+1
m−n+1
n
m
n−m+1
m
n
m
n
m−n+1
380/ 627
![Page 402: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/402.jpg)
I Processes organized in a 2D grid: P × Q such thatP × Q = N number of processes
I Processes identified by their row/column indices
I Example: 8 processes in a 2x4 grid
0 1 2 3
0
1
0 1 2
4 5 6
3
7
I Building a grid:I BLACS GRIDINITI BLACS GRIDMAP
I Termination:I grid: BLACS GRIDEXITI blacs: BLACS EXIT
381/ 627
![Page 403: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/403.jpg)
BLACS: communication routines
I send/receive
I broadcastI Naming:
I character 1: type of data (S,D,C,Z,I)I characters 2+3 : data structure
I GE : general matrixI TR : trapeze matrix (upper, lower, unit or not)
I characters 5 and 6: functionI SD : sendI RV : receiveI BS : broadcastI BR : broadcast (receiver side)
382/ 627
![Page 404: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/404.jpg)
Examples
I send trapezoidal matrix:TRSD2D(ICONTXT, M, N, A, LDA, RDEST, CDEST)
I receive general (rectangular) matrix:GERV2D(ICONTXT, M, N, A, LDA, LDA, RSRC, CSRC)
I broadcast general matrix:GEBS2D(ICONTXT,SCOPE, TOP,M, N, A, LDA)
I broadcast reception:GEBR2D(ICONTXT, SCOPE, TOP, M, N, LDA, RSRC,
CSRC)
383/ 627
![Page 405: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/405.jpg)
ParametersI ICONTXT: context (identifies the grid of processes)
I SCOPE: one process, complete row, complete column or allprocesses
I TOP: network topology emulated
I M : nombre de lignes de A
I N : nombre de colonnes de A
I A : matrice a envoyer A( LDA, * )
I RSRC: row of sender/receiver
I CSRC: column of sender/receiver
Global operators
I GAMX : maximum
I GAMN : minimum
I GSUM : summation
384/ 627
![Page 406: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/406.jpg)
Efficient dense linear algebra librariesUse of scientific librariesLevel 1 BLAS and LINPACKBLASLU FactorizationLAPACKLinear algebra for distributed memory architecturesBLACS (Basic Linear Algebra Communication Subprograms)PBLAS : parallel BLAS for distributed memory machinesScaLAPACKRecursive algorithms
385/ 627
![Page 407: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/407.jpg)
PBLAS : parallel BLAS (LAPACK Working note100, Choi, Dongarra, Ostrouchov, Petitet, Walker,and Whaley)
I Parallel operations based on BLAS 1 + 2 + 3, on top ofsequential BLAS (similar interface)
I Based on BLACSComputations: BLAS / communications: BLACS
I 2D cyclic data distribution → good scalability/load balance
I Used to develop (part of) ScaLAPACK
I PBLAS : subset of BLAS (no operations on band or packedmatrices), some extra operations (matrix transpose)
386/ 627
![Page 408: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/408.jpg)
I level 1 PBLAS:
x ↔ y x tyx ← αx ‖x‖2
y ← x ‖re(x)‖1 + ‖im(x)‖1
y ← αx + y
I largest element of a vector
I level 2 PBLAS:I matrix-vector multiplicationI rank 1 updatesI multiplication by a triangular matrixI triangular system solving
387/ 627
![Page 409: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/409.jpg)
I Level 3 PBLAS:I matrix-matrix multiplicationI rank k and rank 2k updatesI multiplication of a matrix by triangular matrixI solution of triangular systems
I matrix transpose (C → βC + αAt)
388/ 627
![Page 410: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/410.jpg)
Storage, intialization of distributed matrices
Let A : M × N with M = N = 5 partitioned in 2x2 blocksa11 a12 a13 a14 a15a21 a22 a23 a24 a25a31 a32 a33 a34 a35a41 a42 a43 a44 a45a51 a52 a53 a54 a55
389/ 627
![Page 411: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/411.jpg)
Storage, intialization of distributed matrices
On a 2 by 2 grid of processors:0 | 1 | 0
------------------------------------------
a11 a12 | a13 a14 | a15
0 a21 a22 | a23 a24 | a25
------------------------------------------
a31 a32 | a33 a34 | a35
1 a41 a42 | a43 a44 | a45
------------------------------------------
0 a51 a52 | a53 a54 | a55
I Proc 0,0 :a11, a21, a51, a12, a22, a52, a15, a25, a55 (3x3)
I Proc 0,1 :a13, a23, a53, a14, a24, a54 (3x2)
I . . .
I Redistribution routines exist
389/ 627
![Page 412: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/412.jpg)
Array descriptorInteger array of length 9
DESC() Name Scope Definition1 DTYPE A Global Descriptor type DTYPE A=1 for dense
matrices.2 CTXT A Global BLACS context indicating the BLACS
process grid over which the global matrixis distributed.
3 M A Global Number of rows in the global array A.4 N A Global Number of columns in the global array A.5 MB A Global Blocking factor used to distribute the
rows of the array.6 NB A Global Blocking factor used to distribute the
columns of the array.7 RSRC A Global Process row over which the first row of
the array A is distributed.8 CSRC A Global Process row over which the first column
of the array A is distributed.9 LLD A Local Leading dimension of the local array.
![Page 413: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/413.jpg)
Example of PBLAS call
I BLAS callCALL DGEMM (TRANSA, TRANSB, M, N, K, ALPHA,
$ A(IA,JA), LDA, B(IB,JB), LDB,
$ BETA,C(IC,JC), LDC)
I PBLAS callCALL PDGEMM (TRANSA, TRANSB, M, N, K, ALPHA,
$ A, IA, JA, DESCA, B, IB, JB, DESCB,
$ BETA, C, IC, JC, DESC)
391/ 627
![Page 414: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/414.jpg)
![Page 415: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/415.jpg)
Efficient dense linear algebra librariesUse of scientific librariesLevel 1 BLAS and LINPACKBLASLU FactorizationLAPACKLinear algebra for distributed memory architecturesBLACS (Basic Linear Algebra Communication Subprograms)PBLAS : parallel BLAS for distributed memory machinesScaLAPACKRecursive algorithms
393/ 627
![Page 416: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/416.jpg)
ScaLAPACK Software Hierarchy
Goal: reuse most of the existing dense linear algebra software.
Global
Message passing library(MPI,PVM, ...)BLAS
ScaLAPACK
PBLAS
BLACSLAPACK
Local
394/ 627
![Page 417: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/417.jpg)
ScaLAPACK : Right-looking LU Factorization
Conversion of LAPACK codes
I Sequential LU:
1. Factor a column block (I AMAX, SWAP, GER)2. Pivot on the rest of the matrix ( SWAP)3. Update the submatrix ( TRSM suivi de GEMM)
I Parallel Implementation with PBLAS :
1. Factor a column block (P AMAX, P SWAP, P GER)2. Pivot on the rest of the matrix (P SWAP)3. Update the submatrix (P TRSM suivi de P GEMM)
395/ 627
![Page 418: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/418.jpg)
![Page 419: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/419.jpg)
![Page 420: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/420.jpg)
398/ 627
![Page 421: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/421.jpg)
ScaLapack: out-of-core algorithms
I Based on left-looking variants of LU, QR and CholeskyI Same as cache, but much higher latency + smaller bandwidthI QR easier than LU (no pivoting, more flops)
399/ 627
![Page 422: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/422.jpg)
Performance models for ScaLAPACK
I Communication: volume= Cv N2, no of msgs = CmN/NB
I flops: Cf
T (N,P) =Cf N3
Ptf +
Cv N2
√P
tv +CmN
NBtm
(tf : avg. time for a flop, tm: latency, t−1v : bandwidth)
400/ 627
![Page 423: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/423.jpg)
Efficient dense linear algebra librariesUse of scientific librariesLevel 1 BLAS and LINPACKBLASLU FactorizationLAPACKLinear algebra for distributed memory architecturesBLACS (Basic Linear Algebra Communication Subprograms)PBLAS : parallel BLAS for distributed memory machinesScaLAPACKRecursive algorithms
401/ 627
![Page 424: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/424.jpg)
Recursive algorithms
Example 1: LU factorization
1. Split matrix A into two rectangles m x n/2If only 1 column divide column by pivot and return
2. Apply LU algorithm to the left part:A11 = LU with updated A21
3. Apply transformations to the right part(triangular solve A12 = L−1A12 and matrix multiplicationA22 = A22 − A21A12)
4. Apply LU algorithm to the right (square) part
→ Matrices with n/2, n/4, n/8 . . . columns
Example 2: Matrix-matrix multiplication: AB =
„A11 A12
A21 A22
«„B11 B12
B21 B22
«=
„A11B11 + A12B21 A21B11 + A22B21
A11B12 + A12B22 A21B12 + A22B22
«with recursive blocking for each matrix-matrix multiplication.
402/ 627
![Page 425: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/425.jpg)
Recursive algorithms
I automatic adaptation to cache size (at all levels)
I easier to tune and often very efficient compared to classicalapproaches
I can exploit recursive data layoutsI ’Z’ or ’U’ storage for unsymmetric matricesI recursive packed storage possible for symmetric matrices
(Cholesky factorization, . . . )
403/ 627
![Page 426: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/426.jpg)
Conclusion
I Standards have been defined for most useful linear algebraoperations
I One should not try to write his/her own routineI Efficient sequential and parallel implementations available:
I Shared memory architecturesI Distributed memory architectures
I But . . . a higher degree of parallelism is now needed(multicores, thousands of processors)
Software always one step behind architectures
404/ 627
![Page 427: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/427.jpg)
Parallelism in Linear Algebra software so farParallelism in Linear Algebra software so far
LAPACK
ThreadedBLAS
PThreads OpenMP
ScaLAPACK
PBLAS
BLACS+ MPI
Shared Memory Distributed Memory
parallelism
![Page 428: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/428.jpg)
Parallelism in Linear Algebra software so farParallelism in Linear Algebra software so far
LAPACK
ThreadedBLAS
PThreads OpenMP
ScaLAPACK
PBLAS
BLACS+ MPI
Shared Memory Distributed Memoryparallelism
![Page 429: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/429.jpg)
Outline
Introduction to Sparse Matrix ComputationsMotivation and main issuesSparse matricesGaussian elimination
407/ 627
![Page 430: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/430.jpg)
A selection of references
I Books
I Duff, Erisman and Reid, Direct methods for Sparse Matrices,Clarenton Press, Oxford 1986.
I Dongarra, Duff, Sorensen and H. A. van der Vorst, SolvingLinear Systems on Vector and Shared Memory Computers,SIAM, 1991.
I George, Liu, and Ng, Computer Solution of Sparse PositiveDefinite Systems, book to appear
I Articles
I Gilbert and Liu, Elimination structures for unsymmetric sparseLU factors, SIMAX, 1993.
I Liu, The role of elimination trees in sparse factorization,SIMAX, 1990.
I Heath and E. Ng and B. W. Peyton, Parallel Algorithms forSparse Linear Systems, SIAM review 1991.
408/ 627
![Page 431: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/431.jpg)
Introduction to Sparse Matrix ComputationsMotivation and main issuesSparse matricesGaussian elimination
409/ 627
![Page 432: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/432.jpg)
Motivations
I solution of linear systems of equations → key algorithmickernel
Continuous problem↓
Discretization↓
Solution of a linear system Ax = b
I Main parameters:I Numerical properties of the linear system (symmetry, pos.
definite, conditioning, . . . )I Size and structure:
I Large (> 100000× 100000 ?), square/rectangularI Dense or sparse (structured / unstructured)I Target computer (sequential/parallel)
→ Algorithmic choices are critical
410/ 627
![Page 433: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/433.jpg)
Motivations for designing efficient algorithms
I Time-critical applications
I Solve larger problems
I Decrease elapsed time (parallelism ?)
I Minimize cost of computations (time, memory)
411/ 627
![Page 434: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/434.jpg)
Difficulties
I Access to data :I Computer : complex memory hierarchy (registers, multilevel
cache, main memory (shared or distributed), disk)I Sparse matrix : large irregular dynamic data structures.
→ Exploit the locality of references to data on the computer(design algorithms providing such locality)
I Efficiency (time and memory)I Number of operations and memory depend very much on the
algorithm used and on the numerical and structural propertiesof the problem.
I The algorithm depends on the target computer (vector, scalar,shared, distributed, clusters of Symmetric Multi-Processors(SMP), GRID).
→ Algorithmic choices are critical
412/ 627
![Page 435: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/435.jpg)
Introduction to Sparse Matrix ComputationsMotivation and main issuesSparse matricesGaussian elimination
413/ 627
![Page 436: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/436.jpg)
Sparse matrices
Example:
3 x1 + 2 x2 = 52 x2 - 5 x3 = 1
2 x1 + 3 x3 = 0
can be represented as
Ax = b,
where A =
3 2 00 2 −52 0 3
, x =
x1
x2
x3
, and b =
510
Sparse matrix: only nonzeros are stored.
414/ 627
![Page 437: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/437.jpg)
Sparse matrix ?
Original matrix
0 100 200 300 400 500
0
100
200
300
400
500
nz = 5104
Matrix dwt 592.rua (N=592, NZ=5104);Structural analysis of a submarine
415/ 627
![Page 438: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/438.jpg)
Factorization process
Solution of Ax = b
I A is unsymmetric :I A is factorized as: A = LU, where
L is a lower triangular matrix, andU is an upper triangular matrix.
I Forward-backward substitution: Ly = b then Ux = y
I A is symmetric:I A = LDLT or LLT
I A is rectangular m × n with m ≥ n and minx ‖Ax− b‖2 :I A = QR where Q is orthogonal (Q−1 = QT and R is
triangular).I Solve: y = QTb then Rx = y
416/ 627
![Page 439: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/439.jpg)
Factorization process
Solution of Ax = b
I A is unsymmetric :I A is factorized as: A = LU, where
L is a lower triangular matrix, andU is an upper triangular matrix.
I Forward-backward substitution: Ly = b then Ux = y
I A is symmetric:I A = LDLT or LLT
I A is rectangular m × n with m ≥ n and minx ‖Ax− b‖2 :I A = QR where Q is orthogonal (Q−1 = QT and R is
triangular).I Solve: y = QTb then Rx = y
416/ 627
![Page 440: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/440.jpg)
Factorization process
Solution of Ax = b
I A is unsymmetric :I A is factorized as: A = LU, where
L is a lower triangular matrix, andU is an upper triangular matrix.
I Forward-backward substitution: Ly = b then Ux = y
I A is symmetric:I A = LDLT or LLT
I A is rectangular m × n with m ≥ n and minx ‖Ax− b‖2 :I A = QR where Q is orthogonal (Q−1 = QT and R is
triangular).I Solve: y = QTb then Rx = y
416/ 627
![Page 441: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/441.jpg)
Difficulties
I Only non-zero values are stored
I Factors L and U have far more nonzeros than A
I Data structures are complex
I Computations are only a small portion of the code (the rest isdata manipulation)
I Memory size is a limiting factor→ out-of-core solvers
417/ 627
![Page 442: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/442.jpg)
Key numbers:
1- Average size : 100 MB matrix;Factors = 2 GB; Flops = 10 Gflops ;
2- A bit more “challenging” : Lab. Geosiences Azur, Valbonne
I Complex matrix arising in 2D 16× 106 , 150× 106 nonzerosI Storage : 5 GB (12 GB with the factors ?)I Flops : tens of TeraFlops
3- Typical performance (MUMPS):I PC LINUX (P4, 2GHz) : 1.0 GFlops/sI Cray T3E (512 procs) : Speed-up ≈ 170, Perf. 71 GFlops/s
418/ 627
![Page 443: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/443.jpg)
Typical test problems:
BMW car body,227,362 unknowns,5,757,996 nonzeros,MSC.Software
Size of factors: 51.1 million entriesNumber of operations: 44.9 ×109
419/ 627
![Page 444: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/444.jpg)
Typical test problems:
BMW crankshaft,148,770 unknowns,5,396,386 nonzeros,MSC.Software
Size of factors: 97.2 million entriesNumber of operations: 127.9 ×109
420/ 627
![Page 445: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/445.jpg)
Sources of parallelism
Several levels of parallelism can be exploited:
I At problem level: problem can de decomposed intosub-problems (e.g. domain decomposition)
I At matrix level arising from its sparse structure
I At submatrix level within dense linear algebra computations(parallel BLAS, . . . )
421/ 627
![Page 446: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/446.jpg)
Data structure for sparse matrices
I Storage scheme depends on the pattern of the matrix and onthe type of access required
I band or variable-band matricesI “block bordered” or block tridiagonal matricesI general matrixI row, column or diagonal access
422/ 627
![Page 447: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/447.jpg)
Data formats for a general sparse matrix A
What needs to be represented
I Assembled matrices: MxN matrix A with NNZ nonzeros.
I Elemental matrices (unassembled): MxN matrix A with NELTelements.
I Arithmetic: Real (4 or 8 bytes) or complex (8 or 16 bytes)
I Symmetric (or Hermitian)→ store only part of the data.
I Distributed format ?
I Duplicate entries and/or out-of-range values ?
423/ 627
![Page 448: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/448.jpg)
Classical Data Formats for Assembled MatricesI Example of a 3x3 matrix with NNZ=5 nonzeros
a31
a23a22
a11
a33
1 2 3
1
2
3
I Coordinate formatIRN [1 : NNZ ] = 1 3 2 2 3JCN [1 : NNZ ] = 1 1 2 3 3VAL [1 : NNZ ] = a11 a31 a22 a23 a33
I Compressed Sparse Column (CSC) formatIRN [1 : NNZ ] = 1 3 2 2 3VAL [1 : NNZ ] = a11 a31 a22 a23 a33
COLPTR [1 : N + 1] = 1 3 4 6column J is stored in IRN/A locations COLPTR(J)...COLPTR(J+1)-1
I Compressed Sparse Row (CSR) format:Similar to CSC, but row by row
I Diagonal format:NDIAG = 3IDIAG = −2 0 1
VAL =
na na a31
a11 a22 a33
na a23 na
(na: not accessed)
424/ 627
![Page 449: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/449.jpg)
Classical Data Formats for Assembled MatricesI Example of a 3x3 matrix with NNZ=5 nonzeros
a31
a23a22
a11
a33
1 2 3
1
2
3
I Coordinate formatIRN [1 : NNZ ] = 1 3 2 2 3JCN [1 : NNZ ] = 1 1 2 3 3VAL [1 : NNZ ] = a11 a31 a22 a23 a33
I Compressed Sparse Column (CSC) formatIRN [1 : NNZ ] = 1 3 2 2 3VAL [1 : NNZ ] = a11 a31 a22 a23 a33
COLPTR [1 : N + 1] = 1 3 4 6column J is stored in IRN/A locations COLPTR(J)...COLPTR(J+1)-1
I Compressed Sparse Row (CSR) format:Similar to CSC, but row by row
I Diagonal format:NDIAG = 3IDIAG = −2 0 1
VAL =
na na a31
a11 a22 a33
na a23 na
(na: not accessed)
424/ 627
![Page 450: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/450.jpg)
Classical Data Formats for Assembled MatricesI Example of a 3x3 matrix with NNZ=5 nonzeros
a31
a23a22
a11
a33
1 2 3
1
2
3
I Coordinate formatIRN [1 : NNZ ] = 1 3 2 2 3JCN [1 : NNZ ] = 1 1 2 3 3VAL [1 : NNZ ] = a11 a31 a22 a23 a33
I Compressed Sparse Column (CSC) formatIRN [1 : NNZ ] = 1 3 2 2 3VAL [1 : NNZ ] = a11 a31 a22 a23 a33
COLPTR [1 : N + 1] = 1 3 4 6column J is stored in IRN/A locations COLPTR(J)...COLPTR(J+1)-1
I Compressed Sparse Row (CSR) format:Similar to CSC, but row by row
I Diagonal format:NDIAG = 3IDIAG = −2 0 1
VAL =
na na a31
a11 a22 a33
na a23 na
(na: not accessed)
424/ 627
![Page 451: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/451.jpg)
Classical Data Formats for Assembled Matrices
I Example of a 3x3 matrix with NNZ=5 nonzeros
a31
a23a22
a11
a33
1 2 3
1
2
3
I Diagonal format:NDIAG = 3IDIAG = −2 0 1
VAL =
na na a31
a11 a22 a33
na a23 na
(na: not accessed)
424/ 627
![Page 452: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/452.jpg)
Sparse Matrix-vector products
Assume we want to comute Y ← AX .Various algorithms for matrix-vector product depending on sparsematrix format:
I Coordinate format:Y ( 1 :N) = 0DO i =1,NNZ
Y( IRN ( i ) ) = Y( IRN ( i ) ) + VAL( i ) ∗ X(JCN( i ) )ENDDO
I CSC format:
Y ( 1 :N) = 0DO J=1,N
DO I=COLPTR( J ) ,COLPTR( J+1)−1Y( IRN ( I ) ) = Y( IRN ( I ) ) + VAL( I )∗X( J )
ENDDOENDDO
I Diagonal format:
Y ( 1 :N) = 0DO K=1,NDIAG
DO I= max(1,1− IDIAG (K) ) , min (N, N−IDIAG (K) )Y( I ) = Y( I ) + VAL( I , K)∗X( I+IDIAG (K) )
END DOEND DO
425/ 627
![Page 453: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/453.jpg)
Sparse Matrix-vector products
Assume we want to comute Y ← AX .Various algorithms for matrix-vector product depending on sparsematrix format:
I Coordinate format:Y ( 1 :N) = 0DO i =1,NNZ
Y( IRN ( i ) ) = Y( IRN ( i ) ) + VAL( i ) ∗ X(JCN( i ) )ENDDO
I CSC format:Y ( 1 :N) = 0DO J=1,N
DO I=COLPTR( J ) ,COLPTR( J+1)−1Y( IRN ( I ) ) = Y( IRN ( I ) ) + VAL( I )∗X( J )
ENDDOENDDO
I Diagonal format:
Y ( 1 :N) = 0DO K=1,NDIAG
DO I= max(1,1− IDIAG (K) ) , min (N, N−IDIAG (K) )Y( I ) = Y( I ) + VAL( I , K)∗X( I+IDIAG (K) )
END DOEND DO
425/ 627
![Page 454: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/454.jpg)
Sparse Matrix-vector products
Assume we want to comute Y ← AX .Various algorithms for matrix-vector product depending on sparsematrix format:
I Coordinate format:Y ( 1 :N) = 0DO i =1,NNZ
Y( IRN ( i ) ) = Y( IRN ( i ) ) + VAL( i ) ∗ X(JCN( i ) )ENDDO
I CSC format:Y ( 1 :N) = 0DO J=1,N
DO I=COLPTR( J ) ,COLPTR( J+1)−1Y( IRN ( I ) ) = Y( IRN ( I ) ) + VAL( I )∗X( J )
ENDDOENDDO
I Diagonal format:Y ( 1 :N) = 0DO K=1,NDIAG
DO I= max(1,1− IDIAG (K) ) , min (N, N−IDIAG (K) )Y( I ) = Y( I ) + VAL( I , K)∗X( I+IDIAG (K) )
END DOEND DO
425/ 627
![Page 455: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/455.jpg)
Example of elemental matrix format
A1 =123
−1 2 32 1 11 1 1
, A2 =345
2 −1 31 2 −13 2 1
I N=5 NELT=2 NVAR=6 A =
∑NELTi=1 Ai
IELTPTR [1:NELT+1] = 1 4 7ELTVAR [1:NVAR] = 1 2 3 3 4 5ELTVAL [1:NVAL] = -1 2 1 2 1 1 3 1 1 2 1 3 -1 2 2 3 -1 1
I Remarks:
I NVAR = ELTPTR(NELT+1)-1I NVAL =
∑S2
i (unsym) ou∑
Si (Si + 1)/2 (sym), avecSi = ELTPTR(i + 1)− ELTPTR(i)
I storage of elements in ELTVAL: by columns
426/ 627
![Page 456: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/456.jpg)
File storage: Rutherford-Boeing
I Standard ASCII format for filesI Header + Data (CSC format). key xyz:
I x=[rcp] (real, complex, pattern)I y=[suhzr] (sym., uns., herm., skew sym., rectang.)I z=[ae] (assembled, elemental)I ex: M T1.RSA, SHIP003.RSE
I Supplementary files: right-hand-sides, solution,permutations. . .
I Canonical format introduced to guarantee a uniquerepresentation (order of entries in each column, no duplicates).
427/ 627
![Page 457: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/457.jpg)
File storage: Rutherford-Boeing
DNV-Ex 1 : Tubular joint-1999-01-17 M_T1
1733710 9758 492558 1231394 0
rsa 97578 97578 4925574 0
(10I8) (10I8) (3e26.16)
1 49 96 142 187 231 274 346 417 487
556 624 691 763 834 904 973 1041 1108 1180
1251 1321 1390 1458 1525 1573 1620 1666 1711 1755
1798 1870 1941 2011 2080 2148 2215 2287 2358 2428
2497 2565 2632 2704 2775 2845 2914 2982 3049 3115
...
1 2 3 4 5 6 7 8 9 10
11 12 49 50 51 52 53 54 55 56
57 58 59 60 67 68 69 70 71 72
223 224 225 226 227 228 229 230 231 232
233 234 433 434 435 436 437 438 2 3
4 5 6 7 8 9 10 11 12 49
50 51 52 53 54 55 56 57 58 59
...
-0.2624989288237320E+10 0.6622960540857440E+09 0.2362753266740760E+11
0.3372081648690030E+08 -0.4851430162799610E+08 0.1573652896140010E+08
0.1704332388419270E+10 -0.7300763190874110E+09 -0.7113520995891850E+10
0.1813048723097540E+08 0.2955124446119170E+07 -0.2606931100955540E+07
0.1606040913919180E+07 -0.2377860366909130E+08 -0.1105180386670390E+09
0.1610636280324100E+08 0.4230082475435230E+07 -0.1951280618776270E+07
0.4498200951891750E+08 0.2066239484615530E+09 0.3792237438608430E+08
0.9819999042370710E+08 0.3881169368090200E+08 -0.4624480572242580E+08
428/ 627
![Page 458: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/458.jpg)
Introduction to Sparse Matrix ComputationsMotivation and main issuesSparse matricesGaussian elimination
429/ 627
![Page 459: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/459.jpg)
Gaussian elimination
A = A(1), b = b(1), A(1)x = b(1):0@ a11 a12 a13a21 a22 a23a31 a32 a33
1A 0@ x1x2x3
1A =
0@ b1b2b3
1A 2← 2− 1× a21/a113← 3− 1× a31/a11
A(2)x = b(2)0B@ a11 a12 a13
0 a(2)22 a
(2)23
0 a(2)32 a
(2)33
1CA0@ x1
x2x3
1A =
0B@ b1
b(2)2
b(2)3
1CA b(2)2 = b2 − a21b1/a11 . . .
a(2)32 = a32 − a31a12/a11 . . .
Finally A(3)x = b(3)0B@ a11 a12 a13
0 a(2)22 a
(2)23
0 0 a(3)33
1CA0@ x1
x2x3
1A =
0B@ b1
b(2)2
b(3)3
1CAa
(3)(33)
= a(2)(33)− a
(2)32 a
(2)23 /a
(2)22 . . .
Typical Gaussian elimination step k : a(k+1)ij = a
(k)ij −
a(k)ik a
(k)kj
a(k)kk
430/ 627
![Page 460: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/460.jpg)
Relation with A = LU factorization
I One step of Gaussian elimination can be written:A(k+1) = L(k)A(k) , with
Lk =
0BBBBBBB@
1.
.1
−lk+1,k .. .−ln,k 1
1CCCCCCCAand lik =
a(k)ik
a(k)kk
.
I Then, A(n) = U = L(n−1) . . .L(1)A, which gives A = LU ,
with L = [L(1)]−1 . . . [L(n−1)]−1 =
0BBB@1 0
..
.li,j 1
1CCCA ,
I In dense codes, entries of L and U overwrite entries of A.
I Furthermore, if A is symmetric, A = LDLT with dkk = a(k)kk :
A = LU = At = U tLt implies (U)(Lt)−1 = L−1U t = D diagonal
and U = DLt , thus A = L(DLt) = LDLt
![Page 461: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/461.jpg)
Gaussian elimination and sparsity
Step k of LU factorization (akk pivot):
I For i > k compute lik = aik/akk (= a′ik),
I For i > k, j > k
a′ij = aij −aik × akj
akk
ora′ij = aij − lik × akj
I If aik 6= 0 et akj 6= 0 then a′ij 6= 0
I If aij was zero → its non-zero value must be stored k j
k
i
x
x
x
x
k j
k
i
x
x
x
0
fill-in
432/ 627
![Page 462: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/462.jpg)
I Idem for Cholesky :
I For i > k compute lik = aik/√
akk (= a′ik),
I For i > k, j > k, j ≤ i (lower triang.)
a′ij = aij −aik × ajk√
akk
ora′ij = aij − lik × ajk
433/ 627
![Page 463: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/463.jpg)
Example
I Original matrix
x x x x xx xx xx xx x
I Matrix is full after the first step of elimination
I After reordering the matrix (1st row and column ↔ last rowand column)
434/ 627
![Page 464: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/464.jpg)
x xx x
x xx x
x x x x x
I No fill-inI Ordering the variables has a strong impact on
I the fill-inI the number of operations
435/ 627
![Page 465: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/465.jpg)
Table: Benefits of Sparsity on matrix of order 2021 × 2021 with 7353nonzeros. (Dongarra etal 91) .
Procedure Total storage Flops Time (sec.)on CRAY J90
Full Syst. 4084 Kwords 5503 ×106 34.5Sparse Syst. 71 Kwords 1073×106 3.4Sparse Syst. and reordering 14 Kwords 42×103 0.9
436/ 627
![Page 466: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/466.jpg)
Efficient implementation of sparse solvers
I Indirect addressing is often used in sparse calculations: e.g.sparse SAXPY
do i = 1, mA( ind(i) ) = A( ind(i) ) + alpha * w( i )
enddoI Even if manufacturers provide hardware for improving indirect
addressingI It penalizes the performance
I Switching to dense calculations as soon as the matrix is notsparse enough
437/ 627
![Page 467: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/467.jpg)
Effect of switch to dense calculations
Matrix from 5-point discretization of the Laplacian on a 50× 50grid (Dongarra etal 91)
Density for Order of Millions Timeswitch to full code full matrix of flops (sec.)
No switch 0 7 21.81.00 74 7 21.40.80 190 8 15.00.60 235 11 12.50.40 305 21 9.00.20 422 50 5.50.10 531 100 3.70.005 1420 1908 6.1
Sparse structure should be exploited if density < 10%.
438/ 627
![Page 468: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/468.jpg)
Outline
Ordering sparse matricesObjectives/OutlineFill-reducing orderings
439/ 627
![Page 469: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/469.jpg)
Ordering sparse matricesObjectives/OutlineFill-reducing orderings
440/ 627
![Page 470: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/470.jpg)
Ordering sparse matrices: objectives/outline
I Reduce fill-in and number of operations during factorization(local and global heuristics):
I Increase parallelism (wide tree)I Decrease memory usage (deep tree)I Equivalent orderings :
(Traverse tree to minimize working memory)
I Reorder unsymmetric matrices to special forms:I block upper triangular matrix:I with (large) non-zero entries on the diagonal (maximum
transversal).
I Combining approaches
441/ 627
![Page 471: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/471.jpg)
Ordering sparse matricesObjectives/OutlineFill-reducing orderings
442/ 627
![Page 472: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/472.jpg)
Fill-reducing orderings
Three main classes of methods for minimizing fill-in duringfactorization
I Global approach: The matrix is permuted into a matrix with agiven pattern
I Fill-in is restricted to occur within that structureI Cuthill-McKee (block tridiagonal matrix)I Nested dissections (“block bordered” matrix).
443/ 627
![Page 473: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/473.jpg)
Fill-reducing orderings
I Local heuristics: At each step of the factorization, selection ofthe pivot that is likely to minimize fill-in.
I Method is characterized by the way pivots are selected.I Markowitz criterion (for a general matrix).I Minimum degree (for symmetric matrices).
I Hybrid approaches: Once the matrix is permuted in order toobtain a block structure, local heuristics are used within theblocks.
443/ 627
![Page 474: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/474.jpg)
Cuthill-McKee and Reverse Cuthill-McKee
Consider the matrix:
A =
x x x xx x
x x xx x x xx x x
x x
The corresponding graph is
5 3
4 6
12
444/ 627
![Page 475: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/475.jpg)
Cuthill-McKee algorithm
I Goal: reduce the profile/bandwidth of the matrix
(the fill is restricted to the band structure)
I Level sets (such as Breadth First Search) are built from thevertex of minimum degree (priority to the vertex of smallestnumber)We get: S1 = 2,S2 = 1, S3 = 4, 5,S4 = 3, 6 andthus the ordering 2, 1, 4, 5, 3, 6.
The reordered matrix is:
A =
26666664x xx x x x
x x x xx x x
x x xx x
37777775
445/ 627
![Page 476: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/476.jpg)
Reverse Cuthill-McKee
I The ordering is the reverse of that obtained usingCuthill-McKee i.e. on the example 6, 3, 5, 4, 1, 2
I The reordered matrix is:
A =
26666664x x
x x xx x x
x x x xx x x x
x x
37777775I More efficient than Cuthill-McKee at reducing the envelop of
the matrix.
446/ 627
![Page 477: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/477.jpg)
Illustration: Reverse Cuthill-McKee on matrixdwt 592.rua
Harwell-Boeing matrix: dwt 592.rua, structural computing on asubmarine. NZ(LU factors)=58202
Original matrix Factorized matrix
0 100 200 300 400 500
0
100
200
300
400
500
nz = 51040 100 200 300 400 500
0
100
200
300
400
500
nz = 58202
447/ 627
![Page 478: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/478.jpg)
Illustration: Reverse Cuthill-McKee on matrixdwt 592.rua
NZ(LU factors)=16924
Permuted matrix Factorized permuted matrix(RCM)
0 100 200 300 400 500
0
100
200
300
400
500
nz = 51040 100 200 300 400 500
0
100
200
300
400
500
nz = 16924
447/ 627
![Page 479: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/479.jpg)
Nested Dissection
Recursive approach based on graph partitioning.
Graph partitioning Permuted matrix
(1)
(5)
(4)
(2)
S1
S2
S3
S1
12
34
S2
S3
448/ 627
![Page 480: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/480.jpg)
Local heuristics to reduce fill-in during factorization
Let G (A) be the graph associated to a matrix A that we want toorder using local heuristics.Let Metric such that Metric(vi ) < Metric(vj) implies vi is a betterthan vj
Generic algorithmLoop until all nodes are selected
Step1: select current node p (so called pivot) withminimum metric value,
Step2: update elimination graph,Step3: update Metric(vj) for all non-selected nodes vj .
Step3 should only be applied to nodes for which the Metric valuemight have changed.
449/ 627
![Page 481: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/481.jpg)
Reordering unsymmetric matrices: Markowitzcriterion
I At step k of Gaussian elimination:
Ak
L
U
I rki = number of non-zeros in row i of Ak
I ckj = number of non-zeros in column j of Ak
I Candidate pivot aij must be large enough and should minimize(rk
i − 1)× (ckj − 1) ∀i , j ≥ k
I Minimum degree : Markowitz criterion for symmetricdiagonally dominant matrices
450/ 627
![Page 482: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/482.jpg)
Minimum degree algorithm
I Step 1:Select the vertex that possesses the smallest number ofneighbors in G 0.
23
45
67
89
10
1
14
3
5
6
8
9
10
2
7
(a) Sparse symmetric matrix(b) Elimination graph
The node/variable selected is 1 of degree 2.
451/ 627
![Page 483: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/483.jpg)
I Notation for the elimination graph
I Let G k = (V k ,E k) be the graph built at step k.I G k describes the structure of Ak after eliminating k pivots.I G k is non-oriented (Ak is symmetric)I Fill-in in Ak ≡ adding edges in the graph.
452/ 627
![Page 484: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/484.jpg)
Illustration
Step 1: elimination of pivot 1
12
34
56
78
910
4
3
5
6
8
9
10
2
7
1
23
45
67
89
10
1
7
4
3
5
6
8
9
10
2
1
(a) Elimination graph (b) Factors and active submatrix
Initial nonzeros Fill−inNonzeros in factors
![Page 485: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/485.jpg)
Minimum degree algorithm applied to the graph:
I Step k : Select the node with the smallest number ofneighbors
I G k is built from G k−1 by suppressing the pivot and addingedges corresponding to fill-in.
454/ 627
![Page 486: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/486.jpg)
Illustration (cont’d)
Graphs G1,G2,G3 and corresponding reduced matrices.
e
7
4
3
5
6
8
9
10
2
12
34
56
78
910
12
34
56
78
910
4
3
5
6
8
9
10
7
23
45
67
89
10
1
4
5
6
8
9
10
7
(a) Elimination graphs
(b) Factors and active submatrices
Original nonzero Fill−in
Original nonzero modified Nonzeros in factors
455/ 627
![Page 487: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/487.jpg)
Minimum Degree does not always minimize fill-in !!!
12
34
67
89
5
4
3
1
2
6
7
9
8
5
4
3
1
2
6
7
9
8
Consider the following matrix
Remark: Using initial ordering
No fill−in
Corresponding elimination graph
Step 1 of Minimum Degree:
Select pivot 5 (minimum degree = 2)
Updated graph
Add (4,6) i.e. fill−in
456/ 627
![Page 488: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/488.jpg)
Efficient implementation of Minimum degreeReduce time complexity
1. Accelerate selection of pivots and update of the graph:
I 1.1 Supervariables (or indistinguishable nodes): if severalvariables have the same adjacency structure in G k , they canbe eliminated simultaneously.
I 1.2 Two non-adjacent nodes of same degree can be eliminatedsimultaneously (multiple eliminations).
I 1.3 Degree update of neighbours of the pivot can be effectedin an approximate way (Approximate Minimum Degree).
457/ 627
![Page 489: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/489.jpg)
Reduce memory complexity
2. Decrease size of working spaceUsing the elimination graph, working space is of orderO(#nonzeros in factors).
I Fill-in: Let pivot be the pivot at step k
If i ∈ AdjGk−1 (pivot) then AdjGk−1 (pivot) ⊂ AdjGk (i)
Structure of pivot column included in filled structure of column i .
I We can then use an implicit representation of fill-in bydefining the notion of element (variable already eliminated)and quotient graph. A variable of the quotient graph isadjacent to variables and elements.
I One can show that ∀k ∈ [1 . . .N] , the size of the quotientgraph is O(G 0)
458/ 627
![Page 490: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/490.jpg)
Influence on the structure of factors
Harwell-Boeing matrix: dwt 592.rua, structural computing on asubmarine. NZ(LU factors)=58202
0 100 200 300 400 500
0
100
200
300
400
500
nz = 5104
459/ 627
![Page 491: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/491.jpg)
Structure of factors after permutationMinimum Degree MMD (1.1+1.2+2)
0 100 200 300 400 500
0
100
200
300
400
500
nz = 151100 100 200 300 400 500
0
100
200
300
400
500
nz = 14838
Detection of supervariables allows to build more regularly structured
factors (easier factorization).
460/ 627
![Page 492: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/492.jpg)
Comparison of 3 implementations of Minimum Degree
I Let V0 be the initial algorithm (based on the eliminationgraph)
I MMD the version including 1.1/ + 1.2/ + 2/ (MultipleMinimum Degree, Liu 1985, 1989), used in MATLAB
I AMD the version including 1.1/ + 1.3/ + 2/ (ApproximateMinimum Degree, Amestoy, Davis, Duff 1995).
461/ 627
![Page 493: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/493.jpg)
Execution times (secs) on a SUN Sparc 10
Matrix Order Nonzeros Minimum DegreeV0 MMD AMD
dwt 2680 2680 13853 35 0.2 0.2Min. memory size 250KB 110KB 110KB
Wang4 26068 75552 - 11 5
Orani678 2529 85426 - 125 5
I Fill-in is similar
I Memory space for MMD and AMD : ≈ 2× NZ integers
I V0 was not able to perform reordering for the 2 last matrices(lack of memory after 2 hours of computations)
462/ 627
![Page 494: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/494.jpg)
Mininimum fill-in heuristics
Recalling the generic algorithmLet G (A) be the graph associated to a matrix A that we want toorder using local heuristics.Let Metric be such that Metric(vi ) < Metric(vj) ≡ vi is a betterthan vj
Generic algorithmLoop until all nodes are selected
Step1: Select current node p (so called pivot) withminimum metric value,
Step2: update elimination (or quotient) graph,Step3: update Metric(vj) for all non-selected nodes vj .
Step3 should only be applied to nodes for which the Metric valuemight have changed.
463/ 627
![Page 495: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/495.jpg)
Minimum fill based algorithm
I Metric(vi ) is the amount of fill-in that vi would introduce if itwere selected as a pivot.
I Illustration: r has a degree d = 4 and a fill-in metric ofd × (d − 1)/2 = 6 whereas s has degree d = 5 but a fill-inmetric of d × (d − 1)/2− 9 = 1.
r
s
i2
i1
i5
i3 i4j1
j2j3 j4
464/ 627
![Page 496: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/496.jpg)
Minimum fill-in properties
I The situation typically occurs when i1, i2, i3 and i2, i3, i4, i5were adjacent to two already selected nodes (here e2 and e1)
e1 and e2 are previously selected nodes
rs
i1i2
i3i4
i5j1
j2j3
j4
r
s
i2
i1
i5
i3 i4j1
j2j3 j4
e2e1
I The elimination of a node vk affects the degree of nodesadjacent to vk . The fill-in metric of Adj(Adj(vk)) is alsoaffected.
I Illustration: selecting r affects the fill-in metric of i1 (becauseof fill edge (j3, j4)).
![Page 497: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/497.jpg)
How to compute the fill-in metrics
Computing the exact minimum fill-in metric is too costly
I Only nodes adjacent to current pivot are updated.
I Only approximated metrics (using clique structures) arecomputed
I Let dk be the degree of node k ; dk × (dk − 1)/2 is an upperbound of the fill (s → ds = 5 → ds × (ds − 1)/2 = 10).
I Several possibilities:
1. Deduce the clique area of the ”last” selected pivot adjacent tok (s → clique of e2).
2. Deduce the largest clique area of all adjacent selected pivots(s → clique of e1)
3. If for dk we use instead AMD then cliques of all adjacentselected pivots can be deduced.
466/ 627
![Page 498: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/498.jpg)
Outline
Factorization of sparse matricesIntroductionElimination tree and Multifrontal approachImpact of fill reduction algorithm on the shape of the treePostorderings and memory usageEquivalent orderings and elimination treesComparison between 3 approaches for LU factorizationTask mapping and schedulingDistributed memory approachesSome parallel solversCase study: comparison of MUMPS and SuperLUConcluding remarks
467/ 627
![Page 499: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/499.jpg)
Factorization of sparse matrices
Outline
1. Introduction
2. Elimination tree and multifrontal method
3. Comparison between multifrontal,frontal and generalapproaches for LU factorization
4. Task mapping and scheduling
5. Distributed memory approaches: fan-in, fan-out, multifrontal
6. Some parallel solvers; case study on MUMPS and SuperLU.
7. Concluding remarks
468/ 627
![Page 500: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/500.jpg)
Factorization of sparse matricesIntroductionElimination tree and Multifrontal approachImpact of fill reduction algorithm on the shape of the treePostorderings and memory usageEquivalent orderings and elimination treesComparison between 3 approaches for LU factorizationTask mapping and schedulingDistributed memory approachesSome parallel solversCase study: comparison of MUMPS and SuperLUConcluding remarks
469/ 627
![Page 501: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/501.jpg)
Recalling the Gaussian elimination
Step k of LU factorization (akk pivot):
I For i > k compute lik = aik/akk (= a′ik),
I For i > k, j > k such that aik and akj are nonzeros
a′ij = aij −aik × akj
akk
I If aik 6= 0 et akj 6= 0 then a′ij 6= 0
I If aij was zero → its non-zero value must be stored
I Orderings (minimum degree, Cuthill-McKee, ND) limit fill-in,the number of operations and modify the tasks graph
470/ 627
![Page 502: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/502.jpg)
Three-phase scheme to solve Ax = b
1. Analysis stepI Preprocessing of A (symmetric/unsymmetric orderings,
scalings)I Build the dependency graph (elimination tree, eDAG . . . )
2. Factorization (A = LU, LDLT, LLT, QR)
Numerical pivoting
3. Solution based on factored matricesI triangular solves: Ly = b, then Ux = yI improvement of solution (iterative refinement), error analysis
471/ 627
![Page 503: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/503.jpg)
Control of numerical stability: numerical pivoting
I In dense linear algebra partial pivoting commonly used (ateach step the largest entry in the column is selected).
I In sparse linear algebra, flexibility to preserve sparsity isoffered :
I Partial threshold pivoting : Eligible pivots are not too smallwith respect to the maximum in the column.
Set of eligible pivots = r | |a(k)rk | ≥ u ×maxi |a(k)
ik |, where0 < u ≤ 1.
I Then among eligible pivots select one preserving bettersparsity.
I u is called the threshold parameter (u = 1 → partial pivoting).I It restricts the maximum possible growth of: aij = aij − aik×akj
akkI u ≈ 0.1 is often chosen in practice.
I Symmetric indefinite case: requires 2 by 2 pivots, e.g.„
0 11 0
«
![Page 504: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/504.jpg)
Threshold pivoting and numerical accuracy
Table: Effect of variation in threshold parameter u on a 541× 541 matrixwith 4285 nonzeros (Dongarra etal 91) .
u Nonzeros in LU factors Error
1.0 16767 3 ×10−9
0.25 14249 6 ×10−10
0.1 13660 4 ×10−9
0.01 15045 1 ×10−5
10−4 16198 1 ×102
10−10 16553 3 ×1023
473/ 627
![Page 505: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/505.jpg)
Iterative refinement for linear systems
Suppose that a solver has computed A = LU (or LDLT or LLT,and a solution x to Ax = b.
1. Compute r = b− Ax.
2. Solve LU δx = r.
3. Update x = x + δx.
4. Repeat if necessary/useful.
474/ 627
![Page 506: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/506.jpg)
Factorization of sparse matricesIntroductionElimination tree and Multifrontal approachImpact of fill reduction algorithm on the shape of the treePostorderings and memory usageEquivalent orderings and elimination treesComparison between 3 approaches for LU factorizationTask mapping and schedulingDistributed memory approachesSome parallel solversCase study: comparison of MUMPS and SuperLUConcluding remarks
475/ 627
![Page 507: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/507.jpg)
Elimination tree and Multifrontal approach
We recall that:
I The elimination tree expresses dependencies between thevarious steps of the factorization.
I It also exhibits parallelism arising from the sparse structure ofthe matrix.
Building the elimination tree
I Permute matrix (to reduce fill-in) PAPT.
I Build filled matrix AF = L + LT where PAPT = LLT
I Transitive reduction of associated filled graph
→ Each column corresponds to a node of the graph. Each nodek of the tree corresponds to the factorization of a frontalmatrix whose row structure is that of column k of AF .
476/ 627
![Page 508: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/508.jpg)
Illustration of multifrontal factorization
We assume pivots are chosen down the diagonal in order.
_
_
_
3, 4
4 4
1, 3, 4_ 2, 3, 4
3
1 2
F
F
Elimination graph
Filled matrix
Treatment at each node:
I Assembly of the frontal matrix using the contributions fromthe sons.
I Gaussian elimination on the frontal matrix
477/ 627
![Page 509: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/509.jpg)
I Elimination of variable 1 (a11 pivot)I Assembly of the frontal matrix
1 3 4
1 x x x
3 x
4 x
I Contributions : aij =−(ai1×a1j )
a11i > 1, j > 1 on a33, a44, a34 and
a43 :
a(1)33 = − (a31 × a13)
a11a
(1)34 = − (a31 × a14)
a11
a(1)43 = − (a41 × a13)
a11a
(1)44 = − (a41 × a14)
a11
Terms − ai1×a1j
a11of the contribution matrix are stored for later updates.
478/ 627
![Page 510: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/510.jpg)
I Elimination of variable 2 (a22 pivot)I Assembly of frontal matrix: update of elements of pivot row
and column using contributions from previous updates (nonehere)
2 3 4
2 x x x
3 x
4 x
I Contributions on a33, a34, a43, and a44.
a(2)33 = − (a32 × a23)
a22
a(2)34 = − (a32 × a24)
a22
a(2)43 = − (a42 × a23)
a22
a(2)44 = − (a42 × a24)
a22
479/ 627
![Page 511: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/511.jpg)
I Elimination of variable 3.I Assembly of frontal matrix
Update using the previous contributions:
a′33 = a33 + a(1)33 + a
(2)33
a′34 = a34 + a(1)34 + a
(2)34 (a34 = 0)
a′43 = a43 + a(1)43 + a
(2)43 , (a43 = 0)
a′44 = a(1)44 + a
(2)44
stored as a so called contribution matrix.
480/ 627
![Page 512: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/512.jpg)
Note that a44 is partially summed since contributions aretransfered only between son and father.
3 4
3 x x
4 x
I Contribution on a44 : a(3)44 = a′44 −
(a′43×a′34)a′33
I Elimination of variable 4I Frontal involves only a44 : a44 = a44 + a
(3)44
481/ 627
![Page 513: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/513.jpg)
The multifrontal method (Duff, Reid’83)
3
5
4
2
1
1 2 3 4 5
3
5
4
2
1
1 2 3 4 5
A= L+U−I=
Fill−in
00
0
0
0
0 0 0
0
0
00
0 0
0 0
0
0
0 0
0
0
Memory is divided into two parts (thatcan overlap in time):
I the factors
I the active memory
FactorsStack of
contributionblocks
Activefrontalmatrix
Active Memory
3
2
4
5
1
1
5
4 2
3
3
4
4
5
5
Factors
Contribution block
Elimination treerepresents tasks
dependencies
482/ 627
![Page 514: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/514.jpg)
Multifrontal method
From children to parent
I ASSEMBLY:Gather/Scatter operations(indirect addressing)
I ELIMINATION: Densepartial Gaussianelimination, Level 3 BLAS(TRSM, GEMM)
I CONTRIBUTION to parent
483/ 627
![Page 515: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/515.jpg)
Multifrontal method
From children to parent
I ASSEMBLY:Gather/Scatter operations(indirect addressing)
I ELIMINATION: Densepartial Gaussianelimination, Level 3 BLAS(TRSM, GEMM)
I CONTRIBUTION to parent
483/ 627
![Page 516: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/516.jpg)
Multifrontal method
From children to parent
I ASSEMBLY:Gather/Scatter operations(indirect addressing)
I ELIMINATION: Densepartial Gaussianelimination, Level 3 BLAS(TRSM, GEMM)
I CONTRIBUTION to parent
483/ 627
![Page 517: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/517.jpg)
Multifrontal method
From children to parent
I ASSEMBLY:Gather/Scatter operations(indirect addressing)
I ELIMINATION: Densepartial Gaussianelimination, Level 3 BLAS(TRSM, GEMM)
I CONTRIBUTION to parent
483/ 627
![Page 518: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/518.jpg)
Multifrontal method
From children to parent
I ASSEMBLY:Gather/Scatter operations(indirect addressing)
I ELIMINATION: Densepartial Gaussianelimination, Level 3 BLAS(TRSM, GEMM)
I CONTRIBUTION to parent
483/ 627
![Page 519: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/519.jpg)
Multifrontal method
From children to parent
I ASSEMBLY:Gather/Scatter operations(indirect addressing)
I ELIMINATION: Densepartial Gaussianelimination, Level 3 BLAS(TRSM, GEMM)
I CONTRIBUTION to parent
483/ 627
![Page 520: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/520.jpg)
Supernodal methods
Definition
A supernode (or supervariable) is a set of contiguous columns inthe factors L that share essentially the same sparsity structure.
I All algorithms (ordering, symbolic factor., factor., solve)generalize to blocked versions.
I Use of efficient matrix-matrix kernels (improve cache usage).
I Same concept as supervariables for elimination tree/minimumdegree ordering.
I Supernodes and pivoting: pivoting inside a supernode doesnot increase fill-in.
484/ 627
![Page 521: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/521.jpg)
Amalgamation
I GoalI Exploit a more regular structure in the original matrixI Decrease the amount of indirect addressingI Increase the size of frontal matrices
I How?I Relax the number of nonzeros of the matrixI Amalgamation of nodes of the elimination tree
485/ 627
![Page 522: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/522.jpg)
I Consequences?I Increase in the total amount of flopsI But decrease of indirect addressingI And increase in performance
I Remark : If i, i1, i2 . . . if is a son of node j, j1, j2 . . . jp and ifj , j1, j2 . . . jp ⊂ i1, i2 . . . if then the amalgamation of i andj is without fill-in
Amalgamation of supernodes (same lower diagonal structure)is without fill-in
486/ 627
![Page 523: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/523.jpg)
Illustration of amalgamation
Original Matrix Elimination tree
1
2
3
4
5
6
1 2 3 4 5 6
F3
F3
F4
F4
6
4, 5, 6
5, 6
3, 5, 62, 4, 51, 4
(Leaves)
(Root)
Structure of node i = frontal matrix noted i, i1, i2 . . . if
487/ 627
![Page 524: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/524.jpg)
Illustration of amalgamation
Amalgamation
(WITHOUT fill-in) (WITH fill-in )
3, 5, 6
4, 5, 6
2, 4, 51, 4 2, 4, 51, 4
3, 4, 5, 6
fill-in : (3,4) and (4,3)
488/ 627
![Page 525: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/525.jpg)
Amalgamation and Supervariables
Amalgamation of supervariables does not cause fill-inInitial Graph:
1
2
4
3
5
6
7
8
9 10
11
12
13
Reordering: 1, 3, 4, 2, 6, 8, 10, 11, 5, 7, 9, 12, 13Supervariables: 1, 3, 4 ; 2, 6, 8 ; 10, 11 ; 5, 7, 9, 12, 13
489/ 627
![Page 526: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/526.jpg)
Supervariables and multifrontal method
AT EACH NODE
F22 ← F22 − F T12F−1
11 F12
Pivot can ONLY be chosen from F11 block since F22 is NOT fullysummed
490/ 627
![Page 527: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/527.jpg)
Parallelization: two levels of parallelism
I Arising from sparsity : between nodes of the elimination treefirst level of parallelism
I Within each node: parallel dense LU factorization (BLAS)second level of parallelism
Incr
easi
ng n
ode
para
llelis
mD
ecre
asin
g tr
ee p
aral
lelis
m
LU
LL
U U
491/ 627
![Page 528: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/528.jpg)
Exploiting the second level of parallelism is crucial
Multifrontal factorization(1) (2)
Computer #procs MFlops (speed-up) MFlops (speed-up)
Alliant FX/80 8 15 (1.9) 34 (4.3)IBM 3090J/6VF 6 126 (2.1) 227 (3.8)CRAY-2 4 316 (1.8) 404 (2.3)CRAY Y-MP 6 529 (2.3) 1119 (4.8)
Performance summary of the multifrontal factorization on matrix BCSSTK15.In column (1), we exploit only parallelism from the tree. In column (2), wecombine the two levels of parallelism.
492/ 627
![Page 529: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/529.jpg)
Other features
I Dynamic management of parallelism:I Pool of tasks for exploiting the two levels of parallelism
I Assembly operations also parallel (but indirect addressing)
L
U
I Dynamic management of dataI Storage of LU factors, frontal and contribution matricesI Amount of memory available may conflict with exploiting
maximum parallelism
493/ 627
![Page 530: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/530.jpg)
Factorization of sparse matricesIntroductionElimination tree and Multifrontal approachImpact of fill reduction algorithm on the shape of the treePostorderings and memory usageEquivalent orderings and elimination treesComparison between 3 approaches for LU factorizationTask mapping and schedulingDistributed memory approachesSome parallel solversCase study: comparison of MUMPS and SuperLUConcluding remarks
494/ 627
![Page 531: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/531.jpg)
Impact of fill reduction on the shape of the tree
Reorderingtechnique
Shape of the tree observations
AMD
I Deep well-balanced
I Large frontal matriceson top
AMFI Very deep unbalanced
I Small frontal matrices
495/ 627
![Page 532: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/532.jpg)
Reorderingtechnique
Shape of the tree observations
PORDI deep unbalanced
I Small frontal matrices
SCOTCH
I Very widewell-balanced
I Large frontal matrices
METIS
I Wide well-balanced
I Smaller frontalmatrices (thanSCOTCH)
![Page 533: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/533.jpg)
Importance of the shape of the tree
Suppose that each node in the tree corresponds to a task that:- consumes temporary data from the children,- produces temporary data, that is passed to the parent node.
I Wide treeI Good parallelismI Many temporary blocks to storeI Large memory usage
I Deep treeI Less parallelismI Smaller memory usage
497/ 627
![Page 534: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/534.jpg)
Impact of fill-reducing heuristics
Size of factors (millions of entries)
METIS SCOTCH PORD AMF AMD
gupta2 8.55 12.97 9.77 7.96 8.08ship 003 73.34 79.80 73.57 68.52 91.42twotone 25.04 25.64 28.38 22.65 22.12wang3 7.65 9.74 7.99 8.90 11.48xenon2 94.93 100.87 107.20 144.32 159.74
Peak of active memory (millions of entries)
METIS SCOTCH PORD AMF AMD
gupta2 58.33 289.67 78.13 33.61 52.09ship 003 25.09 23.06 20.86 20.77 32.02twotone 13.24 13.54 11.80 11.63 17.59wang3 3.28 3.84 2.75 3.62 6.14xenon2 14.89 15.21 13.14 23.82 37.82
498/ 627
![Page 535: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/535.jpg)
Impact of fill-reducing heuristics
Number of operations (millions)
METIS SCOTCH PORD AMF AMDgupta2 2757.8 4510.7 4993.3 2790.3 2663.9ship 003 83828.2 92614.0 112519.6 96445.2 155725.5twotone 29120.3 27764.7 37167.4 29847.5 29552.9wang3 4313.1 5801.7 5009.9 6318.0 10492.2xenon2 99273.1 112213.4 126349.7 237451.3 298363.5
Matrix coneshl (SAMTECH, 1 million equations)
factor Total memory Floating-pointMatrix order entries required operations
coneshl METIS 687 ×106 8.9 GBytes 1.6×1012
PORD 746 ×106 8.4 GBytes 2.2×1012
499/ 627
![Page 536: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/536.jpg)
Impact of fill-reducing heuristics/MUMPS
Time for factorization (seconds)
1p 16p 32p 64p 128p
coneshl METIS 970 60 41 27 14PORD 1264 104 67 41 26
audi METIS 2640 198 108 70 42PORD 1599 186 146 83 54
Matrices with quasi dense rows:Impact on the ordering time (seconds) of gupta2 matrix
AMD METIS QAMD
Analysis 361 52 23Total 379 76 59
500/ 627
![Page 537: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/537.jpg)
Factorization of sparse matricesIntroductionElimination tree and Multifrontal approachImpact of fill reduction algorithm on the shape of the treePostorderings and memory usageEquivalent orderings and elimination treesComparison between 3 approaches for LU factorizationTask mapping and schedulingDistributed memory approachesSome parallel solversCase study: comparison of MUMPS and SuperLUConcluding remarks
501/ 627
![Page 538: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/538.jpg)
Trees, topological orderings and postorderings
I A rooted tree is a tree for which one node has been selectedto be the root.
I A topological ordering of a rooted tree is an ordering thatnumbers children nodes before their parent.
I Postorderings are topological orderings which number nodes inany subtree consecutively.
502/ 627
![Page 539: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/539.jpg)
Trees, topological orderings and postorderings
u w
x
y
z
v
with topological ordering
1 3
2
4
5
6
Rooted spanning tree
w
yz
xu
v
Connected graph G
u w
x
y
z
v
1
6
54
3
2
Rooted spanning treewith postordering
502/ 627
![Page 540: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/540.jpg)
Postorderings and memory usage
I Assumptions:I Tree processed from the leaves to the rootI Parents processed as soon as all children have completed
(postorder of the tree)I Each node produces and sends temporary data consumed by
its father.I Exercise: In which sense is a postordering-based tree traversal
more interesting than a random topological ordering ?
I Furthermore, memory usage also depends on the postorderingchosen:
a b ab
c
d f
e
g
h
c
d
e
f
g
h
ii
Best (abcdefghi) Worst (hfdbacegi)
Leaves
Root
503/ 627
![Page 541: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/541.jpg)
Postorderings and memory usage
I Assumptions:I Tree processed from the leaves to the rootI Parents processed as soon as all children have completed
(postorder of the tree)I Each node produces and sends temporary data consumed by
its father.I Exercise: In which sense is a postordering-based tree traversal
more interesting than a random topological ordering ?
I Furthermore, memory usage also depends on the postorderingchosen:
a b ab
c
d f
e
g
h
c
d
e
f
g
h
ii
Best (abcdefghi) Worst (hfdbacegi)
Leaves
Root
503/ 627
![Page 542: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/542.jpg)
Example 1: Processing a wide tree
1 2 3 6
4 5
7
![Page 543: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/543.jpg)
1 2 3 6
4 5
7
Memory
..
.
..
.
unused memory space stack memory spacefactor memory space non-free memory space
1
1
65
5
5
5
5
5
11
1 1
1
1
1
1
1
1
2
2
2
2
2
2
2
23
3
3
3
3
3
3
3
4
4
4
4
4
6
6
6
6
6
6
4
7
7
7
Active memory
![Page 544: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/544.jpg)
Example 2: Processing a deep tree
1 2
3
4
Memory
unused memory space stack memory spacefactor memory space non-free memory space
1
11
1122
1 23
1 23
1 2 33
1 2 33 4
1 23 4
..
.
12
..
.
Allocation of 3
Assembly step for 3
Factorization step for 3 +
Stack step for 3
![Page 545: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/545.jpg)
Modelization of the problem
I Mi : memory peak for complete subtree rooted at i ,
I tempi : temporary memory produced by node i ,
I mparent : memory for storing the parent.
M2 M3
M(parent)
M1
temp3temp2
temp1
Mparent = max( maxnbchildrenj=1 (Mj +
∑j−1k=1 tempk),
mparent +∑nbchildren
j=1 tempj)(11)
Objective: order the children to minimize Mparent
507/ 627
![Page 546: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/546.jpg)
Modelization of the problem
I Mi : memory peak for complete subtree rooted at i ,
I tempi : temporary memory produced by node i ,
I mparent : memory for storing the parent.
M2 M3
M(parent)
M1
temp3temp2
temp1
Mparent = max( maxnbchildrenj=1 (Mj +
∑j−1k=1 tempk),
mparent +∑nbchildren
j=1 tempj)(11)
Objective: order the children to minimize Mparent
507/ 627
![Page 547: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/547.jpg)
Memory-minimizing schedules
Theorem.
[Liu,86] The minimum of maxj(xj +∑j−1
i=1 yi ) is obtained whenthe sequence (xi , yi ) is sorted in decreasing order of xi − yi .
Corollary
An optimal child sequence is obtained by rearranging the childrennodes in decreasing order of Mi − tempi .
Interpretation: At each level of the tree, child with relatively largepeak of memory in its subtree (Mi large with respect to tempi )should be processed first.
⇒ Apply on complete tree starting from the leaves(or from the root with a recursive approach)
![Page 548: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/548.jpg)
Optimal tree reordering
Objective: Minimize peak of stack memory
Tree Reorder (T ):Begin
for all i in the set of root nodes doProcess Node(i);
end forEnd
Process Node(i):if i is a leaf then
Mi=mi
elsefor j = 1 to nbchildren do
Process Node(j th child);end forReorder the children of i in decreasing order of (Mj − tempj);Compute Mparent at node i using Formula (11);
end if
![Page 549: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/549.jpg)
Factorization of sparse matricesIntroductionElimination tree and Multifrontal approachImpact of fill reduction algorithm on the shape of the treePostorderings and memory usageEquivalent orderings and elimination treesComparison between 3 approaches for LU factorizationTask mapping and schedulingDistributed memory approachesSome parallel solversCase study: comparison of MUMPS and SuperLUConcluding remarks
510/ 627
![Page 550: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/550.jpg)
Equivalent orderings of symmetric matrices
Let F be the filled matrix of a symmetric matrix A (that is,F = L + Lt , where A = LLT)G +(A) = G (F) is the associated filled graph.
Definition
[Equivalent orderings] P and Q are said to be equivalent orderingsiff G +(PAPT) = G +(QAQT)By extension, a permutation P is said to be an equivalent orderingof a matrix A iff G +(PAPT) = G +(A)
It can be shown that an equivalent reordering also preserves theamount of arithmetic operations for sparse Cholesky factorization.
511/ 627
![Page 551: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/551.jpg)
Relation with elimination trees
I Let A be a reordered matrix, and G +(A) be its filled graphI In the elimination tree, any tree traversal (that processes
children before parents) corresponds to an equivalent orderingP of A and the elimination tree of PAPT is identical to thatof A.
u w
x
y
z
v
w
yz
xu
v
uv
wx
yz
F
F
A =
1 3
2
4
5
6
with topological orderingElimination Tree of A
341
562
Graph G (A)+
512/ 627
![Page 552: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/552.jpg)
Tree rotations
Definition
An ordering that does not introduce any fill is referred to as aperfect ordering
Natural ordering is a perfect ordering of the filled matrix F.
Theorem.
For any node x of G +(A) = G (F), there exists a perfect orderingon G (F) such that x is numbered last.
I Essence of tree rotations :I Nodes in the clique of x in F are numbered lastI Relative ordering of other nodes is preserved.
513/ 627
![Page 553: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/553.jpg)
Example of equivalent orderings
On the right-hand side tree rotation applied on w :(clique of w is w , x and for other nodes relative ordering w.r.t.tree on the left is preserved).
u
vz
wx
yF
F
u w
x
y
z
v
1
6
2
54
3
with a postordering
x
w
z
vy
u
xw
5
4
6
32
1
yu
zv
FF
F =
Remark: Tree rotations can help reducing the temporary memory usage!
![Page 554: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/554.jpg)
Example of equivalent orderings
On the right-hand side tree rotation applied on w :(clique of w is w , x and for other nodes relative ordering w.r.t.tree on the left is preserved).
u
vz
wx
yF
F
u w
x
y
z
v
1
6
2
54
3
with a postordering
x
w
z
vy
u
xw
5
4
6
32
1
yu
zv
FF
F =
Remark: Tree rotations can help reducing the temporary memory usage!
![Page 555: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/555.jpg)
Factorization of sparse matricesIntroductionElimination tree and Multifrontal approachImpact of fill reduction algorithm on the shape of the treePostorderings and memory usageEquivalent orderings and elimination treesComparison between 3 approaches for LU factorizationTask mapping and schedulingDistributed memory approachesSome parallel solversCase study: comparison of MUMPS and SuperLUConcluding remarks
515/ 627
![Page 556: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/556.jpg)
Comparison between 3 approaches for LUfactorization
We compare three general approaches for sparse LU factorization
I General technique
I Frontal method
I Multifrontal approach
distributed memory multifrontal and supernodal approaches will becompared in another section
516/ 627
![Page 557: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/557.jpg)
Description of the 3 approaches
I General techniqueI Numerical and sparsity pivoting performed at the same timeI Dynamic sparse data structuresI Good preservation of sparsity : local decision influenced by
numerical choices.
I Frontal methodI Extension of band or variable-band schemesI No indirect addressing is required in the innermost loop (data
are stored in dense matrices)I Simple data structure, fast methods, easier to implement.I Very popular, easy out-of-core implementation.I Sequential by nature
517/ 627
![Page 558: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/558.jpg)
Description of the 3 approaches
I Multifrontal approachI Can be seen as an extension of frontal methodI Analysis phase to compute an orderingI ordering can then be perturbed by numerical pivotingI Dense matrices are used in the innermost loops.
Compared to frontal schemes:-complex to implement (assembly of dense matrices,
managament of numerical pivoting)-Preserve in a better way the sparse structure.
518/ 627
![Page 559: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/559.jpg)
Frontal method
=
and additionally
A =
a11
a22
ajj
F =
ajj
(j)
Etape 1 (Elimination of a11)
Frontal matrix holds of values for elimination of a11
a11
F(1)
Values for updating are generated:fij= (ai1*a1j)/a11
Etape j (Elimination of ajj, j=2,..)
Frontal matrix holds:−updates from previous steps
−row j and column jof the original matrix
I Properties: The band is treated as full → Efficient reorderingto minimize bandwidth is crucial.
519/ 627
![Page 560: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/560.jpg)
Band reduction: Illustration
Original Matrix Factors
0 100 200 300 400 500
0
100
200
300
400
500
nz = 51040 100 200 300 400 500
0
100
200
300
400
500
nz = 58202
Figure: Matrix dwt 592.rua (N=512, NZ=2007); Structural analysis on asubmarine.
520/ 627
![Page 561: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/561.jpg)
Reordering: Reverse Cuthill-McKee
Reordered Matrix Factors after reordering(RCM)
0 100 200 300 400 500
0
100
200
300
400
500
nz = 51040 100 200 300 400 500
0
100
200
300
400
500
nz = 16924
521/ 627
![Page 562: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/562.jpg)
Frontal vs Multifrontal methods
Considered as full
2
1
1
3
2
Step 2:
Frontal matrix:
Frontal method
Step 2:
Considering frontal
Matrices is sufficient:
2
3
Step 1:
Multifrontal method
3
Several fronts move ahead simultaneously(treatment of blocks 1 and 2 is independent)
522/ 627
![Page 563: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/563.jpg)
Characteristics of multifrontal method:
I More complex data structures.
I Usually more efficient for preserving sparsity than frontaltechniques
I Parallelism arising from sparsity.
523/ 627
![Page 564: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/564.jpg)
Illustration: comparison between 3 software for LU
I General approach (MA38: Davis and Duff):I Control of fill-in: Markowitz criterionI Numerical Stability: partial pivotingI Numerical and sparsity pivoting are performed in one step
I Multifrontal method (MA41, Amestoy and Duff):I Reordering before numerical factorizationI Minimum degree type of reorderingI Partial pivoting for numerical stability
524/ 627
![Page 565: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/565.jpg)
I Frontal method (MA42, Duff and Scott):I Reordering before numerical factorizationI Reordering for decreasing bandwidthI Partial pivoting for numerical stability
I All these software (MA38, MA41, MA42) are available inHSL-Library.
525/ 627
![Page 566: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/566.jpg)
Test problems from Harwell-Boeing and Tim Davis(Univ. Florida) collections.
Matrix Order Nb of Descriptionnonzeros
Orani678.rua 2526 90158 Economic modellingOnetone1.rua 36057 341088 Harmonic balance methodGaron2.rua 13535 390607 2D Navier-StokesWang3.rua 26064 177168 3D Simulation of semiconductormhda416.rua 416 8562 Spectral problem in Hydrodynamicrim.rua 22560 1014951 CFD nonlinear problem
526/ 627
![Page 567: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/567.jpg)
Execution times on a SUN
Matrix FLops Size of factors Time(Method) (×106 words) (seconds)
Onetone1.rua (×109)General 2 5 59Frontal 19 115 6392Multifrontal 8 10 193
Garon2.rua (×108)General 40 8 95Frontal 20 9 86Multifrontal 4 2 8
mhda416.rua (×105)General 24 0.16 0.80Frontal 3 0.02 0.07Multifrontal 16 0.04 0.11
527/ 627
![Page 568: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/568.jpg)
Factorization of sparse matricesIntroductionElimination tree and Multifrontal approachImpact of fill reduction algorithm on the shape of the treePostorderings and memory usageEquivalent orderings and elimination treesComparison between 3 approaches for LU factorizationTask mapping and schedulingDistributed memory approachesSome parallel solversCase study: comparison of MUMPS and SuperLUConcluding remarks
528/ 627
![Page 569: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/569.jpg)
Task mapping and scheduling
I Affect tasks to processors to achieve a goal: makespanminimization, memory minimization, . . .
I many approaches:I static: Build the schedule before the execution and follow it at
run-timeI Advantage: very efficient since it has a global view of the
systemI Drawback: Requires a very-good modelization of the platform
I dynamic: Take scheduling decisions dynamically at run-timeI Advantage: Reactive to the evolution of the platform and
easy to use on several platformsI Drawback: Decisions taken with local criteria (a decision
which seems to be good at time t can have very badconsequences at time t + 1)
529/ 627
![Page 570: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/570.jpg)
Influence of scheduling on the makespan
Objective:
Assign processes/tasks to processors so that the completion time,also called the makespan is minimized. (We may also say that weminimize the maximum total processing time on any processor.)
530/ 627
![Page 571: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/571.jpg)
Task scheduling on shared memory computers
The data can be shared between processors without anycommunication.
I Dynamic scheduling of the tasks (pool of “ready” tasks).
I Each processor selects a task (order can influence theperformance).
I Example of “good” topological ordering (w.r.t time).
3
11
4 521
16
6 7
1312
9 10
14
17
18
Ordering not so good in terms of working memory.
531/ 627
![Page 572: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/572.jpg)
Static scheduling: Subtree to subcube (orproportional) mapping
Main objective: reduce the volume of communication betweenprocessors.
I Recursively partition the processors “equally” betweenchildren of a given node.
I Initially all processors are assigned to root node.I Good at localizing communication but not so easy if no
overlapping between processor partitions at each step.
3
11
4 521
16
6 7
1312
9 10
14
17
181,2,3,4,5
1,2,3
1 2,3
4,5
4
1 1 2 3 3 4 4 5
4,5
4
Mapping of the tasks onto the 5 processors
532/ 627
![Page 573: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/573.jpg)
Mapping of the tree onto the processorsObjective : Find a layer L0 such that subtrees of L0 can be
mapped onto the processor with a good balance.
Construction and mapping of the initial level L0
BeginLet L0 ← Roots of the assembly treerepeat
Find the node q in L0 whose subtree has largestcomputational costSet L0 ← (L0\q) ∪ children of qGreedy mapping of the nodes of L0 onto the processorsEstimate the load unbalance
until load unbalance < thresholdEnd
Step A Step B Step C
![Page 574: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/574.jpg)
Decomposition of the tree into levels
I Determination of Level L0 based on subtree cost.
L
Subtree roots
L0
3
L
1L
2
I Mapping of top of the tree can be dynamic.
I Could be useful for both shared and distributed memory algo.
534/ 627
![Page 575: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/575.jpg)
Factorization of sparse matricesIntroductionElimination tree and Multifrontal approachImpact of fill reduction algorithm on the shape of the treePostorderings and memory usageEquivalent orderings and elimination treesComparison between 3 approaches for LU factorizationTask mapping and schedulingDistributed memory approachesSome parallel solversCase study: comparison of MUMPS and SuperLUConcluding remarks
535/ 627
![Page 576: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/576.jpg)
Distributed memory sparse solvers
536/ 627
![Page 577: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/577.jpg)
Computational strategies for parallel direct solvers
I The parallel algorithm is characterized by:I Computational graph dependencyI Communication graph
I Three classical approaches
1. “Fan-in”2. “Fan-out”3. “Multifrontal”
537/ 627
![Page 578: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/578.jpg)
Preamble: left and right looking approaches forCholesky factorization
I cmod(j , k) : Modification of column j by column k , k < j ,
I cdiv(j) division of column j by a scalar
Left-looking approachfor j = 1 to n do
for k ∈ Struct(row Lj ,1:j−1) docmod(j , k)
cdiv(j)
Right-looking approachfor k = 1 to n do
cdiv(k)for j ∈ Struct(col Lk+1:n,k) do
cmod(j , k)
538/ 627
![Page 579: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/579.jpg)
Illustration of Left and right looking
modified
Left−looking Right−looking
used for modification
539/ 627
![Page 580: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/580.jpg)
Assumptions and Notations
I Assumptions :
I We assume that each column of L/each node of the tree isassigned to a single processor.
I Each processor is in charge of computing cdiv(j) for columns jthat it owns.
I Notations :I mycols(p) = is the set of columns owned by processor p.I map(j) gives the processor owning column j (or task j).I procs(L∗k) = map(j) | j ∈ Struct(L∗k)
(only processors in procs(L∗k) require updates from column k– they correspond to ancestors of k in the tree).
540/ 627
![Page 581: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/581.jpg)
Fan-in variant (similar to left looking)Demand driven algorithm : data required are aggregated updatecolumns computed by sending processor
Fan-in (for processor p)for j = 1 to n do
u = 0for all k ∈ Struct(row Lj ,1:j−1) ∩mycols(p) do
u = u + cmod(j , k)end forif map(j) 6= p then
Send u to processor map(j)end ifif map(j) == p then
Incorporate u in column jReceive all necessary updated aggregates on column j andincorporate them in column jcdiv(j)
end ifend for
![Page 582: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/582.jpg)
Fan-in variant (similar to left looking)
Left Looking(1) (2) (3)
(1)
(3)
(4)
(2)
Modified
Used for modification
(4)
Algorithm:
For j=1 to n do
cdiv(j)
Endfor
(Cholesky)
For k in Struct(L ) do
Endfor
cmod(j,k)j,*(1)
(3)
(2)
(4)
if map(1) = map(2) = map(3) = p and map(4) 6= p (only) onemessage sent by p to update column 4 → exploits data locality ofproportional mapping.
![Page 583: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/583.jpg)
Fan-in variant
P0 P1 P2 P3
P4
if ∀i ∈ children map(i) = P0 and map(father) 6= P0 (only) onemessage sent by P0 → exploits data locality of proportionalmapping.
542/ 627
![Page 584: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/584.jpg)
Fan-in variant
P0 P1 P2 P3
P4
if ∀i ∈ children map(i) = P0 and map(father) 6= P0 (only) onemessage sent by P0 → exploits data locality of proportionalmapping.
542/ 627
![Page 585: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/585.jpg)
Fan-in variant
P0 P1 P2 P3
P4
if ∀i ∈ children map(i) = P0 and map(father) 6= P0 (only) onemessage sent by P0 → exploits data locality of proportionalmapping.
542/ 627
![Page 586: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/586.jpg)
Fan-in variant
P0 P1 P2 P3
P4
if ∀i ∈ children map(i) = P0 and map(father) 6= P0 (only) onemessage sent by P0 → exploits data locality of proportionalmapping.
542/ 627
![Page 587: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/587.jpg)
Fan-in variant
P0 P1 P2 P3
P4
if ∀i ∈ children map(i) = P0 and map(father) 6= P0 (only) onemessage sent by P0 → exploits data locality of proportionalmapping.
542/ 627
![Page 588: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/588.jpg)
Fan-in variant
P0 P1 P2 P3
P4
Communication
if ∀i ∈ children map(i) = P0 and map(father) 6= P0 (only) onemessage sent by P0 → exploits data locality of proportionalmapping.
542/ 627
![Page 589: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/589.jpg)
Fan-in variant
P0 P0 P0 P0
P4
Communication
if ∀i ∈ children map(i) = P0 and map(father) 6= P0 (only) onemessage sent by P0 → exploits data locality of proportionalmapping.
542/ 627
![Page 590: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/590.jpg)
Fan-out variant (similar to right-looking)Data driven algorithm. Fan-out(p):
for all leaf node j ∈ mycols(p) docdiv(j)send column L∗j to procs(col L∗j)mycols(p) = mycols(p)− j
end forwhile mycols(p) 6= ∅ do
Receive any column (say L∗k) of Lfor j ∈ Struct(col L∗k) ∩mycols(p) do
cmod(j , k)if column j completely updated then
cdiv(j)send column L∗jmycols(p) = mycols(p) \ j
end ifend for
end while
![Page 591: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/591.jpg)
Fan-out variant (similar to right-looking)
Right Looking(1) (3)
(1)
(2)
(3)
(4)
(4)
Algorithm:
Endfor
(Cholesky)
For k=1 to n do
cmod(j,k)
cdiv(k)
(1)(3)
Computed
Updated
(4)
(2)
For j in Struct(L*,k) do
Endfor(2)
if map(2) = map(3) = p and map(4) 6= p then 2 messages (forcolumn 2 and 3) are sent by p to update column 4.
![Page 592: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/592.jpg)
Fan-out variant
P0 P1 P2 P3
P4
if ∀i ∈ children map(i) = P0 and map(father) 6= P0 then nmessages (where n is the number of children) are sent by P0 toupdate the processor in charge of the father
544/ 627
![Page 593: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/593.jpg)
Fan-out variant
P0 P0 P0 P0
P4
Communication
if ∀i ∈ children map(i) = P0 and map(father) 6= P0 then nmessages (where n is the number of children) are sent by P0 toupdate the processor in charge of the father
544/ 627
![Page 594: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/594.jpg)
Fan-out variant
P0 P0 P0 P0
P4
Communication
if ∀i ∈ children map(i) = P0 and map(father) 6= P0 then nmessages (where n is the number of children) are sent by P0 toupdate the processor in charge of the father
544/ 627
![Page 595: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/595.jpg)
Fan-out variant
P0 P0 P0 P0
P4
Communication
if ∀i ∈ children map(i) = P0 and map(father) 6= P0 then nmessages (where n is the number of children) are sent by P0 toupdate the processor in charge of the father
544/ 627
![Page 596: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/596.jpg)
Fan-out variant
P0 P0 P0 P0
P4
Communication
if ∀i ∈ children map(i) = P0 and map(father) 6= P0 then nmessages (where n is the number of children) are sent by P0 toupdate the processor in charge of the father
544/ 627
![Page 597: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/597.jpg)
Fan-out variant
Properties of fan-out:
I Historically the first implemented.I Incurs greater interprocessor communications than fan-in (or
multifrontal) approach both in terms ofI total number of messagesI total volume
I Does not exploit data locality of proportional mapping.I Improved algorithm (local aggregation):
I send aggregated update columns instead of individual factorcolumns for columns mapped on a single processor.
I Improve exploitation of data locality of proportional mapping.I But memory increase to store aggregates can be critical (as in
fan-in).
![Page 598: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/598.jpg)
Multifrontal variant
Elimination tree
(1)(3)
(4)
(2)
Computed
Updated
Right Looking
(1) (2) (3)
(1)
(3)
(4)
(4)
(5)
(5)
(2)
(2)(2)(3)(4)(5)
L
(1)(3)
(2)
(4)
"Multifrontal Method"
Algorithm:
For k=1 to n do
Endfor
Partial factorisation
Send Contribution Block to Father
(3)(4)(5)
C B
Build full frontal matrixwith all indices in Struct(L *,k )
546/ 627
![Page 599: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/599.jpg)
Multifrontal variant
P0 P0
P1
P2
(a) Fan-in.
P0 P0
P1
P2
(b) Fan-out.
P0 P0
P1
P2
(c) Multifrontal.
Figure: Communication schemes for the three approaches.
547/ 627
![Page 600: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/600.jpg)
Multifrontal variant
P0 P0
P1
P2
(a) Fan-in.
P0 P0
P1
P2
(b) Fan-out.
P0 P0
P1
P2
(c) Multifrontal.
Figure: Communication schemes for the three approaches.
547/ 627
![Page 601: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/601.jpg)
Multifrontal variant
P0 P0
P1
P2
(a) Fan-in.
P0 P0
P1
P2
(b) Fan-out.
P0 P0
P1
P2
(c) Multifrontal.
Figure: Communication schemes for the three approaches.
547/ 627
![Page 602: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/602.jpg)
Multifrontal variant
P0 P0
P1
P2
(a) Fan-in.
P0 P0
P1
P2
(b) Fan-out.
P0 P0
P1
P2
(c) Multifrontal.
Figure: Communication schemes for the three approaches.
547/ 627
![Page 603: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/603.jpg)
Multifrontal variant
P0 P0
P1
P2
(a) Fan-in.
P0 P0
P1
P2
(b) Fan-out.
P0 P0
P1
P2
(c) Multifrontal.
Figure: Communication schemes for the three approaches.
547/ 627
![Page 604: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/604.jpg)
Factorization of sparse matricesIntroductionElimination tree and Multifrontal approachImpact of fill reduction algorithm on the shape of the treePostorderings and memory usageEquivalent orderings and elimination treesComparison between 3 approaches for LU factorizationTask mapping and schedulingDistributed memory approachesSome parallel solversCase study: comparison of MUMPS and SuperLUConcluding remarks
548/ 627
![Page 605: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/605.jpg)
Some parallel solvers
549/ 627
![Page 606: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/606.jpg)
Shared memory sparse direct codes
Code Technique Scope Availability (www.)
MA41 Multifrontal UNS cse.clrc.ac.uk/Activity/HSL
MA49 Multifrontal QR RECT cse.clrc.ac.uk/Activity/HSL
PanelLLT Left-looking SPD NgPARDISO Left-right looking UNS SchenkPSL† Left-looking SPD/UNS SGI productSPOOLES Fan-in SYM/UNS netlib.org/linalg/spooles
SuperLU Left-looking UNS nersc.gov/∼xiaoye/SuperLU
WSMP‡ Multifrontal SYM/UNS IBM product
† Only object code for SGI is available
550/ 627
![Page 607: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/607.jpg)
Distributed-memory sparse direct codes
Code Technique Scope Availability (www.)
CAPSS Multifrontal LU SPD netlib.org/scalapackMUMPS Multifrontal SYM/UNS graal.ens-lyon.fr/MUMPSPaStiX Fan-in SPD see caption§
PSPASES Multifrontal SPD cs.umn.edu/∼mjoshi/pspases
SPOOLES Fan-in SYM/UNS netlib.org/linalg/spooles
SuperLU Fan-out UNS nersc.gov/∼xiaoye/SuperLU
S+ Fan-out† UNS cs.ucsb.edu/research/S+
WSMP‡ Multifrontal SYM IBM product§ dept-info.labri.u-bordeaux.fr/∼ramet/pastix
‡ Only object code for IBM is available. No numerical pivoting performed.
551/ 627
![Page 608: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/608.jpg)
Distributed-memory sparse direct codes
Code Technique Scope Availability (www.)
CAPSS Multifrontal LU SPD netlib.org/scalapackMUMPS Multifrontal SYM/UNS graal.ens-lyon.fr/MUMPSPaStiX Fan-in SPD see caption§
PSPASES Multifrontal SPD cs.umn.edu/∼mjoshi/pspases
SPOOLES Fan-in SYM/UNS netlib.org/linalg/spooles
SuperLU Fan-out UNS nersc.gov/∼xiaoye/SuperLU
S+ Fan-out† UNS cs.ucsb.edu/research/S+
WSMP‡ Multifrontal SYM IBM productCase study: Comparison of MUMPS and SuperLU
551/ 627
![Page 609: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/609.jpg)
Factorization of sparse matricesIntroductionElimination tree and Multifrontal approachImpact of fill reduction algorithm on the shape of the treePostorderings and memory usageEquivalent orderings and elimination treesComparison between 3 approaches for LU factorizationTask mapping and schedulingDistributed memory approachesSome parallel solversCase study: comparison of MUMPS and SuperLUConcluding remarks
552/ 627
![Page 610: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/610.jpg)
MUMPS (Multifrontal sparse solver)Amestoy, Duff, Guermouche, Koster, L’Excellent, Pralet
1. Analysis and Preprocessing• Preprocessing (max. transversal, scaling)• Fill-in reduction on A + AT
• Partial static mapping (elimination tree)
2. Factorization• Multifrontal (elimination tree of A + AT )
Struct(L) = Struct(U)• Partial threshold pivoting• Node parallelism
- Partitioning (1D Front - 2D Root)
- Dynamic distributed scheduling
3. Solution step and iterative refinement
Features: Real/complex Symmetric/Unsymmetric matrices; Distributed input;
Assembled/Elemental format; Schur complement; multiple sparse
right-hand-sides;
![Page 611: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/611.jpg)
SuperLU (Gaussian elimination with static pivoting)X.S. Li and J.W. Demmel
1. Analysis and Preprocessing• Preprocessing (Max. transversal, scaling)• Fill-in reduction on A + AT
• Static mapping on a 2D grid of processes
2. Factorization• Fan-out (elimination DAGs)• Static pivoting
if (|aii | <√ε‖A‖) set aii to
√ε‖A‖
• 2D irregular block cyclic partitioning (based on supernodestructure)• Pipelining / BLAS3 based factorization
3. Solution step and iterative refinement
Features: Parallel Analysis ; Real and complex matrices ; Multipleright-hand-sides.
![Page 612: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/612.jpg)
MUMPS: dynamic scheduling
Graph of tasks = treeEach task = partial factorization of a dense matrixSome parallel tasks mapped at runtime (80 %)
P0
P0
P3P2
P0 P1
P3
P0 P1
P0
P0
P3
P0
P2 P2
P0
P2P2
P0
P0
P1 P3
P3
SUBTREES
TIM
E
: STATIC
2D static decomposition
555/ 627
![Page 613: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/613.jpg)
MUMPS: dynamic scheduling
Graph of tasks = treeEach task = partial factorization of a dense matrixSome parallel tasks mapped at runtime (80 %)
P0P1
P0
P0
P1
P3
P2
P1
P3P2
P0 P1
P3
P0 P1
P0
P0
P3
P0
P2 P2
P0
P2P2P3P0
P0
P0
P1 P3
P3
SUBTREES
TIM
E
: STATIC
P2
: DYNAMIC
2D static decomposition
555/ 627
![Page 614: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/614.jpg)
MUMPS: dynamic scheduling
Graph of tasks = treeEach task = partial factorization of a dense matrixSome parallel tasks mapped at runtime (80 %)
P0P1
P0
P0
P1
P3
P2
P1
P3P2
P0 P1
P3
P0 P1
P0
P0
P3
P0
P2 P2
P0
P2P2P3P0
P0
P0
P1 P3
P3
SUBTREES
TIM
E
P0P3P2
: STATIC
P2
1D pipelined factorization
: DYNAMIC
P3 and P0 chosen by P2 at runtime
2D static decomposition
555/ 627
![Page 615: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/615.jpg)
Node level parallelism in multifrontal solverMUMPS: pipelined factorization
Slav
e pr
oces
ses
Mas
ter p
roce
ss
LROW
L TRSM + GEMM
TRSM + GEMM
MessageBLOCKFACT
L
LFACTORED BLOCK
NPIV
NPIV1N
PIV
1
U
(P3)
(P4)
(P2)
556/ 627
![Page 616: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/616.jpg)
SuperLU: 2D block cyclic layout and data structures
1
.
.1
. . .
.
...
...
indexStorage of block column of L
# of blocks
nzval
block #
row subscriptsi1i2
# of full rows
block #
row subscriptsi1i2
# of full rows
LDA of nzval
!!!!!!!!!!!!!!!!!!
""""""""""""""""""
########################################
$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
Process Mesh0 1 2
3 4 5
K
L
K
1 2 0 1 2 00
3 4 5
U4 5
210 0 1 2
3 4 5 3 4 5
0 1 2 0 21
3 4 5 3 4
0 1 2 0 1 2 0
0
0
3
3
3
Global Matrix
%%%%
&&&&
3
''''''''''''
((((((((((((
5
557/ 627
![Page 617: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/617.jpg)
Trace of execution(bbmat, 8 proc. CRAY T3E)
Process 0 5 5 5 5 4 4 5 108 5 5 5 5 5 5 5 Facto_L1 4 5 5 5 5 5 5 5 5 5
Process 1 108 4 4 108 5 108 5 5 5 5 5 5 5 Facto_L1 4 5 5 5 5 5 5 5 5 5
Process 2 108 4 4 108 5 5 5 5 5 5 5 5 108 5 108 5 5 5 5 5 5 5 5 4
Process 3 5 5 5 4 108 5 5 4 108 5 5 5 5 5 5 4 108 5 5 5 5 5 5 5 5 5
Process 4 4 108 5 5 4 5 5 5 5 5 5 108 5 108 5 5 5 5 5 5 5 5
Process 5 4 4 4 5 5 4 108 5 5 5 5 5 5 5 5 2 2 2 2 2 2 2 2 2 4
Process 6 4 4 108 108 5 108 5 5 5 5 5 5 5 108 5 5 4 108 5 5 5 5 5 5 5
Process 7 108 4 4 108 2 2 2 2 2 2 2 2 4 108 5 5 5 5 5 5 5 5 5
MPIApplication
L
9.05s9.0s8.95s8.9s
Process 0
Process 1 80 80 80 80 80 80 80 80 80
Process 2 80 80 80 80 80 80 80 80 80 80 80
Process 3
Process 4
Process 5 80 80 80 80 80 80 80 80 80 80
Process 6 80 80 80 80 80 80 80 80 80 80 80
Process 7
MPIVT_APIComm
9.32s9.3s9.28s
558/ 627
![Page 618: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/618.jpg)
Test problems
Real Unsymmetric AssembledMatrix Order NZ StrSym Originbbmat 38744 1771722 54 R.-B. (CFD)ecl32 51993 380415 93 EECS Dept. UC Berkeleyinvextr1 30412 1793881 97 Parasol (Polyflow)fidapm11 22294 623554 99 SPARSKIT2 (CFD)garon2 13535 390607 100 Davis (CFD)lhr71c 70304 1528092 0 Davis (Chem Eng)lnsp3937 3937 25407 87 R.-B. (CFD)mixtank 29957 1995041 100 Parasol (Polyflow)rma1010 46835 2374001 98 Davis (CFD)twotone 120750 1224224 14 R.-B. (circuit sim)Real Symmetric Assembled (rsa)bmwcra 1 148770 5396386 100 Parasol (MSC.Software)cranksg2 63838 7106348 100 Parasol (MSC Software)inline 1 503712 18660027 100 Parasol (MSC Software)
StrSym : structural symmetry;
R.-B. : Rutherford-Boeing set.
559/ 627
![Page 619: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/619.jpg)
Impact of preprocessing and numerical issues
I Objective: Maximize diagonal entries of permuted matrixI MC64 (Harwell Sub. Lib.) code from Duff and Koster (1999)
I Unsymmetric permutation (maximum weighted mactching)and scaling
I Preprocessed matrix B = D1AQD2
is such that |bii | = 1 and |bij | ≤ 1
I Expectations :I MUMPS : reduce NB of off-diagonal pivots and
postponed var. (reduce numerical fill-in)I SuperLU : reduce NB of modified diagonal entriesI Improve accuracy.
560/ 627
![Page 620: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/620.jpg)
MC64 and Flops (109) for factorization (AMD ordering)
Matrix MC64 StrSym MUMPS SuperLU
lhr71c No 0 1431.0(∗) –Yes 21 1.4 0.5
twotone No 28 1221.1 159.0Yes 43 29.3 8.0
fidapm11 No 100 9.7 8.9Yes 29 28.5 22.0
(∗) Estimated during analysis,
– Not enough memory to run the factorization.
561/ 627
![Page 621: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/621.jpg)
Backward error analysis: Berr = maxi|r |i
(|A|·|x |+|b|)i
MUMPS SuperLU
bbmat ecl32 invextr1 fidapm11 garon2 lnsp3937 mixtank rma10 twotone10−16
10−14
10−12
10−10
10−8
10−6
10−4
10−2
100
Ber
r
NO MC64 (MUMPS)MC64 (MUMPS)
bbmat ecl32 invextr1 fidapm11 garon2 lnsp3937 mixtank rma10 twotone10−16
10−14
10−12
10−10
10−8
10−6
10−4
10−2
100
Ber
r
NO MC64 (SuperLu)MC64 (SuperLu)
One step of iterative refinement generally leads to Berr ≈ εCost (1 step of iterative refinement) ≈ Cost (LUx = b − Ax)
562/ 627
![Page 622: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/622.jpg)
Factorization cost study
I SuperLU preserves/exploits betterthe sparsity/asymmetry than MUMPS.This results in
++ smaller size of factors (less memory)++ fewer operations++ more independency/parallelism−− Extra cost of taking into account asymmetry−− Smaller block-size for BLAS-3 kernels
563/ 627
![Page 623: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/623.jpg)
Cost of preserving sparsity(time on T3E 4 Procs)
bbmat(50) fidapm11(46) twotone(43) lhr71c(21) 0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Rat
ios
(Sup
erLU
/MU
MP
S)
Size Factors (SuperLU / MUMPS)Flops "" Facto Time "" Solve Time ""
564/ 627
![Page 624: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/624.jpg)
Nested Dissection versus Minimum Degree orderings(time on T3E 4 Procs)
Flops study Factorization time ratio (AMD/ND)
bbmat ecl32 invextr1 mixtank 0
5
10
15
20
25
30
35
40
45
50
55
Flop
s (E
9)
AMD (MUMPS) AMD(SuperLU)ND(MUMPS) ND(superLU)
bbmat ecl32 invextr1 mixtank 0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Rat
io A
MD
/ND
MUMPS SuperLU
565/ 627
![Page 625: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/625.jpg)
Communication issues
Average Vol.(64 procs) Average Message Size (64 procs)
bbmat ecl32 invextr1 mixtank twotone0
10
20
30
40
50
60Average Communication Volume on 64 Processors
Mby
tes
MUMPS (AMD) SuperLU (AMD)MUMPS (ND) SuperLU (ND)
bbmat ecl32 invextr1 mixtank twotone0
5
10
15
20
25Average Message Size on 64 Processors
Kby
tes
MUMPS (AMD) SuperLU (AMD)MUMPS (ND) SuperLU (ND)
566/ 627
![Page 626: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/626.jpg)
Time Ratios of the numerical phasesTime(SuperLU) / Time(MUMPS)
Factorization Solve
4 8 16 32 64 128 256 5120.5
1
1.5
2
2.5
3
3.5
4
Processors
Rat
io(S
uper
LU/M
UM
PS
)
bbmat ecl32 invextr1mixtank twotone
4 8 16 32 64 128 256 5120.5
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
6
Processors
Rat
io(S
uper
LU/M
UM
PS
)
bbmat ecl32 invextr1mixtank twotone
567/ 627
![Page 627: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/627.jpg)
Time (in seconds) of the numerical phases
Factorization Solve
4 8 16 32 64 128 256 512
5
10
15
20
25
30
35
40
45
Processors
Tim
e (S
econ
ds)
Time for factorization phase
MUMPS (bbmat) "" (ecl32) "" (invextr1) "" (mixtank) "" (twotone) SuperLU (bbmat) "" (ecl32) "" (invextr1) "" (mixtank) "" (twotone)
4 8 16 32 64 128 256 5120
0.5
1
1.5
2
2.5
3
3.5
4
Processors
Tim
e (S
econ
ds)
Time for solve phase
MUMPS (bbmat) "" (ecl32) "" (invextr1) "" (mixtank) "" (twotone) SuperLU (bbmat) "" (ecl32) "" (invextr1) "" (mixtank) "" (twotone)
568/ 627
![Page 628: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/628.jpg)
Performance analysis on 3-D grid problemsRectangular grids - Nested Dissection ordering
Megaflop rate Efficiency
1 2 4 8 16 32 64 1280
50
100
150
200
250
Processors
Meg
aflo
p ra
te
MUMPS−SYMMUMPS−UNSSuperLU
1 2 4 8 16 32 64 1280
0.2
0.4
0.6
0.8
1
1.2
Processors
Effi
cien
cyMUMPS−SYMMUMPS−UNSSuperLU
569/ 627
![Page 629: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/629.jpg)
Summary
I Sparsity and Total memory-SuperLU preserves better sparsity-SuperLU (≈ 20%) less memory on 64 Procs (Asymmetry -
Fan-out/Multifrontal)
I Communication-Global volume is comparable-MUMPS : much smaller (/10) nb of messages
I Factorization / Solve time-MUMPS is faster on nprocs ≤ 64-SuperLU is more scalable
I Accuracy-MUMPS provides a better initial solution-SuperLU : one step of iter. refin. often enough
570/ 627
![Page 630: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/630.jpg)
Factorization of sparse matricesIntroductionElimination tree and Multifrontal approachImpact of fill reduction algorithm on the shape of the treePostorderings and memory usageEquivalent orderings and elimination treesComparison between 3 approaches for LU factorizationTask mapping and schedulingDistributed memory approachesSome parallel solversCase study: comparison of MUMPS and SuperLUConcluding remarks
571/ 627
![Page 631: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/631.jpg)
Concluding remarks
I Key parameters in selecting a method1. Functionalities of the solver2. Characteristics of the matrix
I Numerical properties and pivoting.I Symmetric or generalI Pattern and density
3. Preprocessing of the matrixI ScalingI Reordering for minimizing fill-in
4. Target computer (architecture)
I Substantial gains can be achieved with an adequate solver: interms of numerical precision, computing and storage
I Good knowledge of matrix and solversI Many challenging problems
I Active research area
572/ 627
![Page 632: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/632.jpg)
Outline
Iterative MethodsBasic iterative methods (stationary methods)Krylov subspace methodsPreconditioning
573/ 627
![Page 633: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/633.jpg)
Iterative Methods
Principles:
I Generates a sequence of approximates x (k) to the solution
I Essentially involves matrix-vector products
I Often linked to preconditioning techniques:Ax = b → MAx = Mb
I Evaluation of a method speed of convergence
574/ 627
![Page 634: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/634.jpg)
Direct method vs. Iterative method
DirectI Very general technique
I High numerical accuracyI Sparse matrices with
irregular patterns
I Factorization of AI May be costly in terms of
memory for factorsI Factors can be reused for
successive/multipleright-hand sides
IterativeI Efficiency depends on the
type of the problemI Convergence
preconditionningI Numerical properties
structure of A
I Requires the product of Aby a vector
I Less costly in terms ofmemory and possibly flops
I Solutions with successiveright-hand sides can beproblematic
575/ 627
![Page 635: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/635.jpg)
Iterative MethodsBasic iterative methods (stationary methods)Krylov subspace methodsPreconditioning
576/ 627
![Page 636: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/636.jpg)
Basic iterative methods (stationary methods)
Definition
An iterative method is called stationary if x (k+1) can be expressedas a function of x (k) only.
I Residual at iteration k: r (k) = b − Ax (k)
I i th component:
r(k)i = bi −
∑j aijx
(k)j = bi −
∑j 6=i aijx
(k)j − aiix
(k)j
I Idea: try to “reset” the ri components to 0. This gives:Do i = 1, n
x(k+1)i = 1
aii(bi −
∑j 6=i aijx
(k)j )
EndDo
Jacobi iteration
577/ 627
![Page 637: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/637.jpg)
Basic iterative methods (stationary methods)
Definition
An iterative method is called stationary if x (k+1) can be expressedas a function of x (k) only.
I Residual at iteration k: r (k) = b − Ax (k)
I i th component:
r(k)i = bi −
∑j aijx
(k)j = bi −
∑j 6=i aijx
(k)j − aiix
(k)j
I Idea: try to “reset” the ri components to 0. This gives:Do i = 1, n
x(k+1)i = 1
aii(bi −
∑j 6=i aijx
(k)j )
EndDo
Jacobi iteration
577/ 627
![Page 638: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/638.jpg)
Gauss-Seidel method
I Jacobi:Do i = 1, n
x(k+1)i = 1
aii(bi −
∑i−1j=1 aijx
(k)j −
∑nj=i+1 aijx
(k)j )
EndDo
I Remark that one does not use the latest information
I Gauss-Seidel iteration:Do i = 1, n
x(k+1)i = 1
aii(bi −
∑i−1j=1 aijx
(k+1)j −
∑nj=i+1 aijx
(k)j )
EndDo
578/ 627
![Page 639: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/639.jpg)
Stationary Methods: Matrix Approach
Decompose A as
A = L + D + U
where D is the diagonal of A, and L (resp. U) is the strictly lower(resp. upper) triangular part.
I Given a non-singular matrix M, we define the recurrence:x (k+1) = M−1(b − (A−M)x (k)) = x (k) + M−1r (k)
(Note: y = M−1z means “Solve My = z for y”)
I Jacobi iteration:M = Dx (k+1) = D−1(b − (A− D)x (k)) = x (k) + D−1r (k)
I Gauss-Seidel iteration:M = L + Dx (k+1) = (D +L)−1(b−(A−D−L)x (k)) = x (k) +(D +L)−1r (k)
579/ 627
![Page 640: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/640.jpg)
Variants
Successive over-relaxation (SOR):
x (k+1) = ωx(k+1)GaussSeidel + (1− ω)x (k)
= xk + ω(D + ωL)−1(b − Ax (k))
Choice of ω:
I Theoretical optimal values for limited classes of problems
I Problematic in general
Many other variants depending on the choice of the matrix M(block Jacobi, block Gauss-Seidel, . . . )
580/ 627
![Page 641: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/641.jpg)
Convergence properties
Previous methods follow the model:x (k+1) = x (k) + M−1(b − Ax (k))Knowing that the solution x∗ satisfies Ax∗ = b, this gives:x (k+1) − x∗ = x (k) − x∗ + M−1(Ax∗ − Ax (k))Thus:
x (k+1) − x∗ = (I −M−1A)(x (k) − x∗)
Theorem.
The sequence (x (k))k=1,2,... defined by
x (k+1) = x (k) + M−1(b − Ax (k))
converges for all x (0) to the solution x∗ iff the spectral radius ofI −M−1A satisfies the inequality ρ(I −M−1A) < 1.
581/ 627
![Page 642: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/642.jpg)
Convergence properties
Proof.
x (k+1) − x∗ = (I −M−1A)k(x (0) − x∗)
⇒ Let (λ, v) be an eigenpair of Gdef= I −M−1A.
G kv = λkv , thus limk→∞ G k = 0⇒ |λ| < 1
⇐ Based on the Jordan decomposition: there exists a matrix V such
that G = V−1JV , J =
. . .
λi 1 0 · · · 00 λi 1 · · · 0...
.... . .
. . ....
0 0 · · · λi 10 0 · · · 0 λi
. . .
Then G k = (V−1JV )k = V−1JkV , and we can check that eachdiagonal block of Jk tends 0 if |λi | < 1.
582/ 627
![Page 643: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/643.jpg)
Typical steps to design an iterative method:
I Propose a matrix M where linear systems of the form Mz = dare “easy” to solve
I Classes of matrices are identified for which the iterationmatrix G = I −M−1A satisfies ρ(G ) < 1
I Find further results about ρ(G ) to gain intuition onconvergence speed
583/ 627
![Page 644: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/644.jpg)
Convergence properties
Theorem.
If A ∈ lCn×n is strictly diagonal dominant, then the Jacobiiterations converge.
Proof.
I − D−1A = −D−1(L + U)ρ(I − D−1A) ≤ ‖D−1(L + U)‖∞ = maxi
∑j 6=i |
aij
ajj| < 1
Theorem.
If A is symmetric positive definite, then the Gauss-Seidel iterationsconverge (for any x0).
584/ 627
![Page 645: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/645.jpg)
Implementation of the Jacobi iteration
I Matrix-vector product using CSR format:
c$omp P a r a l l e l Do SHARED( i a , ja , v a l , x , y ) PRIVATE( i )do k=1,n
y ( k ) = 0 . 0 d0do i= i a ( k ) , i a ( k+1)−1
y ( k ) = y ( k ) + v a l ( i )∗ x ( j a ( i ) )enddo
enddo
I Jacobi iteration x (k+1) ← x (k) + D−1(b− (L + U)x (k)) can bevectorized and parallellized similarly to a matrix-vector product
585/ 627
![Page 646: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/646.jpg)
Example of implementation of Gauss-Seidel
Consider the 5-point stencil1 1
1
1
−4
applied to the Poisson equation(∆g(x , y) = f (x , y), g = 0 on the boundary) on a NxM grid, weobtain an NM-by-NM block tridiagonal system Ag = f :
A =
T −IN
−IN T. . .
. . .. . . −IN−IN T
, T =
4 −1
−1 4. . .
. . . −1−1 4
g = (G (1, 1), . . .G (N, 1),G (1, 2), . . . ,G (N, 2), . . . ,G (NM))T ,f = (f11, . . . fN1, f12, . . . , fN2, . . . , fNM)T
Which form does the Gauss-Seidel iteration take for this system ?
586/ 627
![Page 647: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/647.jpg)
With the convention G (i , j) = 0 for i = 0, i = N, j = 0 or j = M,the Gauss-Seidel iteration takes the form:DO j=1,MDO i=1,N
G (i , j) = (fij +G (i−1, j)+G (i+1, j)+G (i , j−1)+G (i , j+1))/4ENDDOENDDO
No storage required for the matrix in this case !
Parallel implementation for shared memory machines(M=N case)DO k=1,NC$OMP Parallel do, private(i,j)DO j=1,k
i = k − j + 1G (i , j) = (fij +G (i−1, j)+G (i+1, j)+G (i , j−1)+G (i , j+1))/4ENDDOENDDODO k=N+1,2*N-1...
587/ 627
![Page 648: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/648.jpg)
With the convention G (i , j) = 0 for i = 0, i = N, j = 0 or j = M,the Gauss-Seidel iteration takes the form:DO j=1,MDO i=1,N
G (i , j) = (fij +G (i−1, j)+G (i+1, j)+G (i , j−1)+G (i , j+1))/4ENDDOENDDO
No storage required for the matrix in this case !
Parallel implementation for shared memory machines(M=N case)DO k=1,NC$OMP Parallel do, private(i,j)DO j=1,k
i = k − j + 1G (i , j) = (fij +G (i−1, j)+G (i+1, j)+G (i , j−1)+G (i , j+1))/4ENDDOENDDODO k=N+1,2*N-1...
587/ 627
![Page 649: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/649.jpg)
Conclusion on stationary methods
I Relatively easy to implement and parallelize
I Often depend on parameters that are difficult to forecast(example: ω in SOR)
I Convergence difficult to guarantee in finite precision arithmetic
I Krylov methods are preferred
588/ 627
![Page 650: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/650.jpg)
Iterative MethodsBasic iterative methods (stationary methods)Krylov subspace methodsPreconditioning
589/ 627
![Page 651: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/651.jpg)
Krylov method : some background
Aleksei Nikolaevich Krylov1863-1945: Russia, Maritime EngineerHis research spans a wide range of topics, includ-ing shipbuilding, magnetism, artillery, mathematics,astronomy, and geodesy. In 1904 he built the firstmachine in Russia for integrating ODEs.In 1931 he published a paper on what is now calledthe ”Krylov subspace”.
Definition
Let A ∈ IRn×n and r ∈ IRn; the space denoted by K(r ,A,m) (withm ≤ n) and defined by
K(r ,A,m) = Spanr ,Ar , ...,Am−1r
is referred to as the Krylov space of dimension m associated withA and r .
590/ 627
![Page 652: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/652.jpg)
Why using this search space ?
[?] For the sake of simplicity of exposure, we often assume x0 = 0.This does not mean a loss of generality, because the situationx0 6= 0 can be transformed with a simple shift to the systemAy = b − Ax0 = b, for which obviously y0 = 0.The minimal polynomial q(t) of A is the unique monic polynomialof minimal degree such that q(A) = 0. It is constructed from theeigenvalues of A as follows. If the distinct eigenvalues of A areλ1, ..., λ` and if λj has index mj (the size of the largest Jordanblock associated with λj), then the sum of all indices is
m =∑j=1
mj , and q(t) =∏j=1
(t − λj)mj . (12)
When A is diagonalizable m is the number of distinct eigenvaluesof A; when A is a Jordan block of size n, then m = n.
![Page 653: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/653.jpg)
Example
A = V−1
3 1
34
4
V
I eigenvalue 3 of index 2
I eigenvalue 4 of index 1
I This gives m = 3 and q(t) = (t − 3)2(t − 4) whereas thecharacteristic polynomial is (t − 3)2(t − 4)2.
I Property: q(A) = 0
592/ 627
![Page 654: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/654.jpg)
If we write
q(t) =∏j=1
(t − λj)mj =
m∑j=0
αj tj ,
then the constant term is α0 =∏j=1
(−λj)mj . Therefore α0 6= 0 iff A
is nonsingular. Furthermore, from
0 = q(A) = α0I + α1A + ...+ αmAm, (13)
it follows that
A−1 = − 1
α0
m−1∑j=0
αj+1Aj .
This description of A−1 portrays x = A−1b immediately as amember of the Krylov space of dimension m associated with A andb denoted by K(b,A,m) = Spanb,Ab, ...,Am−1b.
![Page 655: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/655.jpg)
Taxonomy of the Krylov subspace approaches
The Krylov methods for identifying xm ∈ K(b,A,m) can bedistinguished in four classes:
I The Ritz-Galerkin approach (FOM, CG,. . . ):construct xm such that the residual is orthogonal to thecurrent subspace: b − Axm⊥K(b,A,m).
I The minimum norm residual approach (GMRES,. . . ):construct xm ∈ K(b,A,m) such that ||b − Axm||2 is minimal
I The Petrov-Galerkin approach:construct xm such that b − Axm is orthogonal to some otherm-dimensional subspace.
I The minimum norm error approach:construct xm ∈ ATK(b,A,m) such that ||b − Axm||2 isminimal.
594/ 627
![Page 656: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/656.jpg)
Constructing a basis of K(b,A,m)
I Obvious choice b, Ab, . . . , Am−1bI not very attractive from the numerical point of view because
vectors Ajb become more and more colinear to the eigenvectorassociated to the largest eigenvalue.
I In finite arithmetic, leads to a loss of rank: suppose A isdiagonalizable A = VDV−1, then Akb = VDk(V−1b).
I A better choice is the Arnoldi procedure.
595/ 627
![Page 657: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/657.jpg)
Arnoldi
Walter Edwin Arnoldi1917-1995: USA.His main research subjects covered vibration of propellers, enginesand aircraft, high speed digital computers, aerodynamics andacoustics of aircraft propellers, lift support in space vehicles andstructural materials.”The principle of minimized iterations in the solution of theeigenvalue problem” in Quart. of Appl. Math., Vol.9 in 1951.
596/ 627
![Page 658: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/658.jpg)
The Arnoldi procedure
This procedure builds an orthonormal basis of K(A, b,m).
Arnoldi’s algorithm1: v1 = b/‖b‖2: for j = 1, 2, . . .m − 1 do3: Compute hi ,j = vT
i Avj for i = 1, . . . , j
4: Compute wj = Avj −j∑
i=1
hi ,jvi
5: Compute hj+1,j = ‖wj‖6: Exit if (hj+1,j = 0)7: Compute vj+1 = wj/hj+1,j
8: end for
597/ 627
![Page 659: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/659.jpg)
The Arnoldi procedure properties
Proposition
If the Arnoldi procedure does not stop before the mth step, thevectors v1, ...., vm form an orthonormal basis of the Krylovsubspace K(A, b,m)
Proof.The vectors are orthogonal by construction.They span K(A, b,m − 1) follows from the fact that each vector vj is ofthe form qj−1(A)v1, where qj−1 is a polynomial of degree j − 1. This canbe shown by induction. For j = 1 it is true as v1 = q0(A)v1 with q0 = 1.Assume that it is true for all j and consider vj+1. We have:
hj+1,jvj+1 = Avj −j∑
i=1
hi,jvi = Aqj−1(A)v1 +
j∑i=1
hi,jqi−1(A)v1.
So vj+1 can be expressed as qj(A)v1 where qj is of degree j .
![Page 660: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/660.jpg)
Conjugate Gradients
Conjugate Gradient Method
I Solve Ax = b, with A symmetric positive definite
I Belongs to Ritz-Galerkin approaches (constructxm ∈ K(b,A,m) such that b − Axm⊥K(b,A,m))
I First introduced by Hestenes and Stiefels in 1952 [?]
Definition
Two non-zero vectors u and v are conjugate (with respect to A) ifu>Av = 0.
Because A is symmetric positive definite, the left-hand side definesan inner product〈u, v〉A := 〈Au, v〉 = 〈u,Av〉 = u>Av .
599/ 627
![Page 661: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/661.jpg)
Conjugate Gradients
Let (pk)k=1...n be a sequence of conjugate directions. They form abasis of IRn, so the solution of Ax = b can be written:
x∗ = α1p1 + · · ·+ αnpn.
Computing the αk :
Ax∗ = α1Ap1 + · · ·+ αnApn = b.p>k Ax∗ = p>k α1Ap1 + · · ·+ p>k αkApk + · · ·+ p>k αnApn =αkp>k Apk = p>k b.
αk =p>k b
p>k Apk= 〈pk ,b〉
〈pk ,pk 〉A = 〈pk ,b〉‖pk‖2
A.
Possible (direct) method to build a solution:
1. Build a set of n conjugate directions
2. Compute the coefficients αk (and x∗)
600/ 627
![Page 662: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/662.jpg)
Conjugate Gradient Method: main principles
Optimization point of view
Note that the solution x∗ is also the unique minimizer of thequadratic function f (x) = 1
2 xT Ax − btx , x ∈ IR.Steepest descent algorithm: search the successive xk by movingfrom xk to xk+1 in the direction−grad f (xk) = −∇f (xk) = b − Axk = rk :So it makes sense to choose as first direction: p0 = r0 ∈ K(r0,A, 1)
This gives x1 = x0 + α0p0 where α0 =rT0 r0
pT0 Ap0
was defined above.
Remark that r1 = b − Ax1 = p0 − α0Ap0 is orthogonal to p0 (byconstruction).We then choose p1 ∈ K(r0,A, 2) such that p1Ap0 = 0.More generally:
I rk+1⊥span(r0, . . . , rk) = span(p0, . . . , pk) = K(r0,A, k)I Choose pk+1 of the form rk+1 − βkApk ∈ K(r0,A, k + 1)
Conjugacy condition: pj+1⊥Apj ⇒ βk = −pTk Ark+1
pTk Apk
601/ 627
![Page 663: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/663.jpg)
r1
p0 p1
r2p2
x2
x3
x0
x1
piApj = 0 ∀ i , jrk⊥pi for i = 1 . . . k − 1
602/ 627
![Page 664: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/664.jpg)
Geometric interpretation in 2D
Minimize f (x) = 12 xT Ax − bT x
Steepest descent: orthogonal directionsConjugate gradients: A-orthogonal (or conjugate) directions
x0
x1
x2
x3
x0
x1
x2
603/ 627
![Page 665: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/665.jpg)
The algorithm (after simplifications)
CG algorithm
1: r0 = b − Ax0, p0 = r0.2: for j = 0, 1, . . . do3: αj = (rT
j rj)/(pTj Apj)
4: xj+1 = xj + αjpj
5: rj+1 = rj − αjApj
6: if rj+1 “sufficiently” small then7: Exit the loop8: end if9: βj = (rT
j+1rj+1)/(rTj rj)
10: pj+1 = rj+1 + βjpj
11: end for
604/ 627
![Page 666: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/666.jpg)
CG: Implementation and parallelization issues
Storage: four vectors (x , p,Ap, r)Main kernels involved:
I one matrix-vector product per iteration (parallelizable)I two dot products (latency + synchronization !!)
I No issue on shared memory computers – BLAS 1 routineI Distributed memory computer: each processor computes its
local contribution using the components of its own, followedby a reduction.
I Warning: in finite elements, must decide who is responsible forthe variables at the interface between 2 processors
605/ 627
![Page 667: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/667.jpg)
CG: Convergence properties
I Let xk be the kth iterate generated by the CG algorithm andκ(A) the ratio λmax
λmin. Then
‖xk − x?‖A ≤ 2 ·
(√κ(A)− 1√κ(A) + 1
)k
‖x0 − x?‖A.
I Much better if λmaxλmin
is close to 1.
I if A diagonalizable with m distincts eigenvaluesthen CG converges in at most m steps (minimal polynomial).
I Furthermore, convergence is quicker if eigenvalues areclustered
I How to improve κ / better cluster the eigenvalues ?
Preconditioning
606/ 627
![Page 668: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/668.jpg)
CG: Convergence properties
I Let xk be the kth iterate generated by the CG algorithm andκ(A) the ratio λmax
λmin. Then
‖xk − x?‖A ≤ 2 ·
(√κ(A)− 1√κ(A) + 1
)k
‖x0 − x?‖A.
I Much better if λmaxλmin
is close to 1.
I if A diagonalizable with m distincts eigenvaluesthen CG converges in at most m steps (minimal polynomial).
I Furthermore, convergence is quicker if eigenvalues areclustered
I How to improve κ / better cluster the eigenvalues ?Preconditioning
606/ 627
![Page 669: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/669.jpg)
Iterative MethodsBasic iterative methods (stationary methods)Krylov subspace methodsPreconditioning
607/ 627
![Page 670: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/670.jpg)
Driving principles to design preconditioners
Find a non-singular matrix M such that MA has “better” propertiesv.s. the convergence behaviour of the selected Krylov solver
I MA has less distinct eigenvalues,
I MA ≈ I in some sense.
608/ 627
![Page 671: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/671.jpg)
The preconditioner constraints
The preconditioner should
I be cheap to compute and to store,
I be cheap to apply,
I ensure a fast convergence.
With a good preconditioner the solution time for thepreconditioned system should be significantly less that for theunpreconditioned system.
609/ 627
![Page 672: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/672.jpg)
The particular case of CG
For CG let M be given in a factorized form (i.e. M = CCT ), thenCG can be applied to
Ax = b,
with A = CT AC , C x = x and b = CT b.Let us define:
xk = C xk ,
C pk = pk ,
rk = CT rk ,
zk = CCT rk ,
610/ 627
![Page 673: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/673.jpg)
Using A = CT AC , xk = C xk , pk = C pk , rk = CT rk we can writethe CG algorithm for both the preconditioned variables and theunpreconditioned ones.
Conjugate Gradient algorithm
1. Compute r0 = b − Ax0 and p0 = r0
2. For k=0,2, ... Do
3. αk = rTk rk/pT
k Apkαk = rT
k CCT rk/pTk Apk
= rTk zk/pT
k Apk
4. xk+1 = xk + αk pkC⇒ xk+1 = xk + αkpk
5. rk+1 = rk − αk ApkC−T
⇒ rk+1 = rk − αkApk
6. βk = rTk+1rk+1/rT
k rkβk = rT
k+1CCT rk+1/rTk CCT rk
= rTk+1zk+1/rT
k zk
7. pk+1 = rk+1 + βk pk
C⇒ pk+1 = CCT rk+1 + βkpk
= zk+1 + βkpk
8. if xk accurate enough then stop9. EndDo
611/ 627
![Page 674: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/674.jpg)
Writing the algorithm only using the unpreconditioned variablesleads to:
Preconditioned Conjugate Gradient algorithm1. Compute r0 = b − Ax0, z0 = Mr0 and p0 = r0
2. For k=0,2, ... Do3. αk = rT
k zk/pTk Apk
4. xk+1 = xk + αkpk
5. rk+1 = rk − αkApk
6. zk+1 = Mrk+1
7. βk = rTk+1zk+1/rT
k zk
8. pk+1 = zk+1 + βkpk
9. if xk accurate enough then stop10. EndDo
612/ 627
![Page 675: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/675.jpg)
MA hass less distinct eigenvalues: an example
Let A =
(A BT
C 0
)and P =
(A 00 CA−1BT
).
Then P−1A has three distinct eigenvalues.[ Murphy, Golub, Wathen, SIAM SISC, 21 (6), 2000]
613/ 627
![Page 676: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/676.jpg)
Preconditioner taxonomy
There are two main classes of preconditioners
I Implicit preconditioners:approximate A with a matrix M such that solving the linearsystem Mz = r is easy.
I Explicit preconditioners:approximate A−1 with a matrix M and just perform z = Mr .
The governing ideas in the design of the preconditioners are verysimilar to those followed to define iterative stationary schemes.Consequently, all the stationary methods can be used to definepreconditioners.
614/ 627
![Page 677: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/677.jpg)
Stationary methods
Let x0 be given and M ∈ IRn×n a nonsingular matrix, compute
xk = xk−1 + M(b − Axk−1).
Note that b − Axk−1 = A(x∗ − xk−1) ⇒ the best M is A−1.The stationary sheme converges to x∗ = A−1b for any x0 iffρ(I −MA) < 1, where ρ(·) denotes the spectral radius.Let A = L + D + U
I M = I : Richardson method,
I M = D−1 : Jacobi method,
I M = (L + D)−1 : Gauss-Seidel method.
Notice that M has always a special structure and the inverse mustnever been explicitely computed (z = B−1y reads solve the linearsystem Bz = y).
615/ 627
![Page 678: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/678.jpg)
Preconditioner location
Several possibilities exist to solve Ax = b:
I Left preconditionerMAx = Mb.
I Right preconditioner
AMy = b with x = My .
I Split preconditioner if M = M1M2
M2AM1y = M2b with x = M1y .
Notice that the spectrum of MA, AM and M2AM1 are identical(for any matrices B and C , the eigenvalues of BC are the same asthose of CB)
616/ 627
![Page 679: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/679.jpg)
Some classical algebraic preconditioners
I Incomplete factorization : IC , ILU(p), ILU(p, τ)
I SPAI (Sparse Approximate Inverse): compute the sparseapproximate inverse by minimizing the Frobenius norm‖MA− I‖F
I FSAI (Factorized Sparse Approximate inverse): compute thesparse approximate inverse of the Cholesky factor byminimizing the Frobenius norm ‖I − GL‖F
I AINV (Approximate Inverse): compute the sparse approximateinverse of the LDU or LDLT factors using an incompletebiconjugation process
617/ 627
![Page 680: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/680.jpg)
Incomplete factorizations
One variant of the LU factorization writes:
IKJ variant - Top looking variant1. for i = 2, ..., n do2. for k = 1, ..., i − 1 do3. ai,k = ai,k/ak,k
4. for j = k + 1, ..., n do5. ai,j = ai,j − ai,k ∗ ak,j
5. end for6. end for7. end for
618/ 627
![Page 681: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/681.jpg)
Zero fill-in ILU - ILU(0)
Let NZ (A) denote the set of (rwo,column) index of the nonzeroentries of A.
ILU(0)1. for i = 2, ..., n do2. for k = 1, ..., i − 1 and (i , k) ∈ NZ (A) do3. ai,k = ai,k/ak,k
4. for j = k + 1, ...,n and (i , j) ∈ NZ (A) do5. ai,j = ai,j − ai,k ∗ ak,j
6. end for7. end for8. end for
619/ 627
![Page 682: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/682.jpg)
Level of fill in ILU - ILU(p)
Definition
The initial level of fill of an entry ai ,j is defined by:
lev(i , j) =
0 if ai ,j 6= 0 or i = j,∞ otherwise.
Each time this entry is modified in line 5 of the LU top lookingalgorithm, its level fill is updated by
lev(i , j) = minlev(i , j), lev(i , k) + lev(k, j) + 1. (14)
620/ 627
![Page 683: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/683.jpg)
A first example
A =
x x x xx xx xx x
lev =
0 0 0 00 0 ∞ ∞0 ∞ 0 ∞0 ∞ ∞ 0
→
0 0 0 00 0 1 10 1 0 10 1 1 0
.
ILU(p)1. for all nonzero entries ai,j set levi,j i = 02. for i = 2, ..., n do3. for k = 1, ..., i − 1 and levi,k ≤ p do4. ai,k = ai,k/ak,k
5. for j = k + 1, ...,n do6. ai,j = ai,j − ai,k ∗ ak,j
7. lev(i , j) = minlev(i , j), lev(i , k) + lev(k, j) + 18. if lev(i , j) > p then ai,j = 09. end for
10. end for11. end for
621/ 627
![Page 684: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/684.jpg)
Another example
0BBBBBBBBBB@
× × × ×× ×× × ×
× × ×× × × ×× × × ×
× × ×× ×
1CCCCCCCCCCA→
0BBBBBBBBBB@
0 ∞ ∞ 0 ∞ 0 ∞ 0∞ 0 ∞ ∞ 0 ∞ ∞ ∞∞ 0 0 ∞ 1 ∞ 0 ∞0 ∞ ∞ 0 0 1 ∞ 1∞ 0 0 ∞ 0 0 ∞ 1∞ ∞ 0 0 2 0 0 20 ∞ ∞ 1 2 0 0 1∞ ∞ ∞ ∞ 0 1 2 0
1CCCCCCCCCCAIt may require to store many fill-ins that are small in absolute value.
622/ 627
![Page 685: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/685.jpg)
ILU(p, τ) dual treshold strategy
I Fix a drop tolerance τ and a number of fill p to be allowed ineach row of the incomplete LU factors. At each step of theelimination process, drop all fill-ins that are smaller than τtimes the 2-norm of the current row; for all the remainingones keep only the p largest.
I Trade-off between amount of fill-in (construction time andapplication time for the preconditioner) and decrease ofnumber of iterations.
623/ 627
![Page 686: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/686.jpg)
Incomplete factorization v.s. megaflop performance
The preconditioning step requires the solution of two sparsetriangular factors that can lead to poor performance on vectorcomputers due to the data dependencies. Special treatment can beimplemented for structured matrices (FD matrices).
MFlops rate for MFlops rate for MFlops rate forComputer PCG PCG with structure CGNEX SX-3 60 607 1124Cray C-90 56 444 737RS 6000 19 18 21
624/ 627
![Page 687: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/687.jpg)
The SPAI preconditioner
The idea is to compute the sparse approximate inverse as thematrix M which minimizes ‖I −MA‖F (or ‖I − AM‖F for rightpreconditioning) subject to certain sparsity constraints. The choiceof the Frobenius norm is motivated by the identity:
‖I − AM‖2F =
n∑j=1
‖ej − Am∗,j‖22 (15)
where ej is the jth unit vector and m∗,j is the column vectorrepresenting the jth row of M. Because of the sparsity constraintonly a few rows and columns of A are used to compute eachcolumn of M. The least-squares problems can be solved efficienlywith dense QR.
625/ 627
![Page 688: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/688.jpg)
The SPAI preconditioner (cont)
I Embarrasingly parallel construction.
I Preconditioner application reduces to a sparse matrice vectorproduct.
I For some appplications the pattern of the inverse can beprescribed a priori (e.g. by considering the pattern of power ofA, ParSails code by E. Chow).
The main difficulty consists in the determination of the sparsitypattern.
626/ 627
![Page 689: High Performance Matrix Computations/Calcul Matriciel Haute](https://reader031.vdocument.in/reader031/viewer/2022013107/586e15511a28abf0718b75de/html5/thumbnails/689.jpg)
Hybrid approaches
One route to the solution of large sparse linear systems in parallelscientific computing is the use of hybrid methods that combinedirect and iterative methods. These techniques inherit theadavantages of each approach, namely the limited amount ofmemory and easy parallelization for the iterative component andthe numerical robustness of the direct part.
I Block preconditionners (block Jacobi, algebraic Schwarzvariants, ...).
I Domain decomposition techniques.
627/ 627