the impact of computer architectures on linear architectures...
Post on 30-Jun-2019
230 Views
Preview:
TRANSCRIPT
1
The Impact Of ComputerThe Impact Of ComputerArchitectures On LinearArchitectures On Linear Algebra and NumericalAlgebra and Numerical
LibrariesLibraries
Jack Dongarra Innovative Computing Laboratory University of Tennessee
/
High Performance ComputersHigh Performance Computers♦ ~ 20 years ago ¾ 1x106 Floating Point Ops/sec (Mflop/s)
¾ Scalar based
♦ ~ 10 years ago ¾ 1x109 Floating Point Ops/sec (Gflop/s)
¾ Vector & Shared memory computing, bandwidth aware¾ Block partitioned, latency tolerant
♦ ~ Today ¾ 1x1012 Floating Point Ops/sec (Tflop/s)
¾ Highly parallel, distributed processing, message passing, network based ¾ data decomposition, communication/computation
♦ ~ 10 years away ¾ 1x1015 Floating Point Ops/sec (Pflop/s)
¾ Many more levels MH, combination/grids&HPC ¾ More adaptive, LT and bandwidth aware, fault tolerant,
extended precision, attention to SMP nodes 2
TOP500TOP500- Listing of the 500 most powerful Computers in the World
- Yardstick: Rmax from LINPACK MPP Ax=b, dense problem
Size
TPP performance
- Updated twice a year SC‘xy in the States in November Meeting in Mannheim, Germany in June
Rat
e
- All data available from www.top500.org 3
GFl
op/s
Fastest Computer Over Time
504540353025201510
5 0
(2048)
(8)
TMC CM-2
Fujitsu VP-2600
Cray Y-MP
1990 1992 1994 1996 1998 2000
YearIn 1980 a computation that took 1 full year to complete can now be done in ~ 9 hours!
GFl
op/s
Fastest Computer Over Time
500 450 400 350 300 250 200 150 100
50 Y-MP
0 1990 1992 1994 1996 1998 2000
Year
(2040)
(6788)
Fuj
(140)
(1024)
(4)(2048)
Fuj
(8)
Hitachi CP-PACS
Intel Paragon
itsu VPP-500
TMC CM-5 NEC
SX-3 TMC CM-2 its u
VP-2600 Cray
In 1980 a computation that took 1 full year to complete can now be done in ~ 13 minutes!
GFl
op/s
Fastest Computer Over Time5000 ASCI White
Pacific 4500 (7424) 4000 3500 Intel ASCI
Red Xeon3000 ASCI Blue
2500 Pacific SST (9632)
Intel (5808)2000 ASCI Red SGI ASCI 1500 (9152) Blue
Intel Hitachi Mountain 1000
0
TMC Fujitsu Paragon CP-PACS
500 Cray TMC NEC CM-5 VPP-500 (6788) (2040) (5040)
Fujitsu CM-2 SX-3 (1024) (140)Y-MP (8) VP-2600 (2048) (4)
1990 1992 1994 1996 1998 2000
YearIn 1980 a computation that took 1 full year to complete can today be done in ~90 second!
Top 10 Machines (Nov 2000)Top 10 Machines (Nov 2000)Rank Company Machine Procs Gflop/s Place Country Year
1 IBM ASCI White 8192 4938 Livermore National Laboratory Livermore 2000
2 Intel ASCI Red 9632 2380 Sandia National Labs Albuquerque USA 1999
3 IBM ASCI Blue-Pacific SST, IBM SP 604e 5808 2144 Lawrence Livermore National
Laboratory Livermore USA 1999
4 SGI ASCI Blue Mountain 6144 1608 Los Alamos National Laboratory
Los Alamos USA 1998
5 IBM SP Power3 375 MHz 1336 1417 Naval Oceanographic Office
(NAVOCEANO) USA 2000
6 IBM SP Power3 375 MHz 1104 1179 National Center for
Environmental Protection USA 2000
7 Hitachi SR8000-F1/112 112 1035 Leibniz Rechenzentrum Muenchen Germany 2000
8 IBM SP Power3 375 MHz, 8 way 1152 929 UCSD/San Diego
Supercomputer Center USA 2000
9 Hitachi SR8000-F1/100 100 917 High Energy Accelerator
Research Organization /KEK Tsukuba
Japan 2000
10 Cray Inc. T3E1200 1084 892 Government USA 1998
Performance DevelopmentPerformance Development100 Tflop/s
10 Tflop/s
1 Tflop/s
100 Gflop/s
10 Gflop/s
1 Gflop/s
100 Mflop/s
88.1 TF/s
4.9 TF/s
55.1 GF/s Intel XP/S140
Sandia
Fujitsu 'NWT' NAL
SNI VP200EX Uni Dresden
Cray Y-MP M94/4
KFA Jülich
Cray Y-MP C94/364
'EPA' USA
Hitachi/Tsukuba CP-PACS/2048
SGI POWER
CHALLANGE GOODYEAR
Intel ASCI Red
Sandia
Intel ASCI Red
Sandia
Sun Ultra HPC 1000
News International
Sun HPC 10000 Merril Lynch
Fujitsu 'NWT' NAL
IBM ASCI White
IBM 604e 69 proc
A&P
N=1
N=500
SUM
n-93Nov-9
3Ju
n-94Nov-9
4Ju
n-95ov-9
5Ju
n-96 6Ju
n-97 7n-98
Nov-98
Jun-99
Nov-99
Jun-00
Nov-009 9v v-
Ju N o o JuN N
[60G - 400 M][4.9 Tflop/s 55Gflop/s], Schwab #15, 1/2 each year, 209 > 100 Gf, faster than Moore’s law,
Performance DevelopmentPerformance Development
0.1
1
10
100
1000
10000
100000
1000000
Perfo
rman
ce [G
Flop
/s]
N=1
N=500
Sum
1 TFlop/s
1 PFlop/s ASCI
Earth Simulator
My Laptop
9 4 97-93 -96 99 0 02 3 05 6 08- - -0 -0 -0 -
Nov
Nov n - n- n- nn n v v v v
Ju Ju Ju No Ju No Ju No Ju No
Entry 1 T 2005 and 1 P 2010
-09
ArchitecturesArchitectures
Single Processor
SMP
MPP
SIMD Constellation Cluster - NOW
0
100
200
300
400
500
Y-MP C90
Sun HPC
Paragon
CM5 T3D
T3E
SP2
Cluster of Sun HPC
ASCI Red
CM2
VP500
SX3
93
94
95 96 96 97 97
98 -99 99 -00 00v-93 -9 5 -9894-n- n- n-
ov ov- v- n- v- ov- v v-un
No un
Non no No
NoJu Ju Ju Ju Ju JuN N N NJ J
112 const, 28 clus, 343 mpp, 17 smp
10
Chip Technology Chip Technology
Inmos Transputer
Alpha
IBM
HP
intel
MIPS
SUN
Other
0
100
200
300
400
500
Jun-93
Nov-93
Jun-94
Nov-94
Jun-95
Nov-95
Jun-96
Nov-96
Jun-97
Nov-97
Jun-98
Nov-98
Jun-99
Nov-99
Jun-00
Nov-00
HighHigh--Performance Computing Directions:Performance Computing Directions: BeowulfBeowulf--class PC Clustersclass PC Clusters
♦ COTS PC Nodes ♦ Best price-Definition: Advantages:
¾ Pentium, Alpha,PowerPC, SMP
♦ COTS LAN/SANInterconnect
performance ♦ Low entry-level cost ♦ Just-in-place
configuration¾ Ethernet, Myrinet,
♦ Vendor invulnerableGiganet, ATM ♦ Scalable♦ Open Source Unix
¾ Linux, BSD ♦ Rapid technology ♦ Message Passing tracking
Computing Enabled by PC hardware, networks and operating system ¾ MPI, PVM achieving capabilities of scientific workstations at a fraction of
the cost and availability of industry standard message¾ HPF passing libraries. However, much more of a contact sport. 12
Perf
orm
ance
Where Does the Performance Go? orWhere Does the Performance Go? orWhy Should I Care About the Memory Hierarchy?Why Should I Care About the Memory Hierarchy?
µProc1000
DRAM
CPU
Processor-Memory Performance Gap: (grows 50% / year)
“Moore’s Law”
Processor-DRAM Memory Gap (latency)
60%/yr. (2X/1.5yr)
100
10DRAM9%/yr.
1 (2X/10 yrs)
1980
19
81
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
Time 13
Optimizing Computation andOptimizing Computation and Memory UseMemory Use
♦ Computational optimizations¾ Theoretical peak:(# fpus)*(flops/cycle) * Mhz
¾ PIII: (1 fpu)*(1 flop/cycle)*(850 Mhz) = 850 MFLOP/s ¾ Athlon: (2 fpu)*(1flop/cycle)*(600 Mhz) = 1200 MFLOP/s ¾ Power3: (2 fpu)*(2 flops/cycle)*(375 Mhz) = 1500 MFLOP/s
♦ Operations like: T¾ α = x y : 2 operands (16 Bytes) needed for 2 flops;
at 850 Mflop/s will requires 1700 MW/s bandwidth ¾ y = α x + y : 3 operands (24 Bytes) needed for 2 flops;
at 850 Mflop/s will requires 2550 MW/s bandwidth
♦ Memory optimization ¾ Theoretical peak: (bus width) * (bus speed)
¾ PIII : (32 bits)*(133 Mhz) = 532 MB/s = 66.5 MW/s¾ Athlon: (64 bits)*(133 Mhz) = 1064 MB/s = 133 MW/s¾ Power3: (128 bits)*(100 Mhz) = 1600 MB/s = 200 MW/s
14
Memory HierarchyMemory Hierarchy♦ By taking advantage of the principle of locality:¾ Present the user with as much memory as is available in
the cheapest technology. ¾ Provide access at the speed offered by the fastest
technology.
Control
Datapath
Secondary Storage (Disk)
Processor
Registers
Main Memory (DRAM)
Level 2 and 3 Cache
(SRAM)
On-C
hipC
ache
Tertiary Storage
(Disk/Tape)
Distributed Memory
Remote Cluster
Memory
Speed (ns): 1s 10s Size (bytes): 100s
Ks
100s 10,000,000s 10,000,000,000s(10s ms) (10s sec)
Ms 100,000 s 10,000,000 s(.1s ms) (10s ms)
Gs Ts
SelfSelf--Adapting NumericalAdapting Numerical Software (SANS)Software (SANS)
♦ Today’s processors can achieve high-performance, butthis requires extensive machine-specific hand tuning.
♦ Operations like the BLAS require many man-hours /platform • Software lags far behind hardware introduction • Only done if financial incentive is there
♦ Hardware, compilers, and software have a largedesign space w/many parameters ¾ Blocking sizes, loop nesting permutations, loop unrolling
depths, software pipelining strategies, register allocations,and instruction schedules.
¾ Complicated interactions with the increasingly sophisticated micro-architectures of new microprocessors.
♦ Need for quick/dynamic deployment of optimized routines. ♦ ATLAS - Automatic Tuned Linear Algebra Software 16
Software GenerationSoftware Generation StrategyStrategy♦ Level 1 cache multiply
optimizes for: ¾ TLB access¾ L1 cache reuse¾ FP unit usage¾ Memory fetch¾ Register reuse¾ Loop overhead
minimization ♦ Takes about 30 minutes
to run. ♦ “New” model of high
performance programmingwhere critical code is machine generated usingparameter optimization.
♦ Code is iterativelygenerated & timed untiloptimal case is found. We try: ¾ Differing NBs ¾ Breaking false
dependencies ¾ M, N and K loop unrolling
♦ Designed for RISC arch ¾ Super Scalar ¾ Need reasonable C
compiler
♦ Today ATLAS in use byMatlab, Mathematica, Octave, Maple, Debian, Scyld Beowulf, SuSE, …
17
ATLASATLAS (DGEMM n = 500)(DGEMM n = 500)
0.0
100.0
200.0
300.0
400.0
500.0
600.0
700.0
800.0
900.0
MFL
OPS
Vendor BLAS ATLAS BLAS F77 BLAS
0 00270
112 00
0
135
500 933
200
60 200
266
16533 2
n -- -2- -
28-
c2-
30
ro II -
04- 2 335/
56 6 -ev
IIIo erer mP
ow arw m p phl
C6/7ev p0 i 00iu00 00
um
AMD At C tio
BMPP ti SnC iu enDE PP 2090 a
DE Pe
GI R10 1ent trM M
R
P P UlBIBH PII GI nSuS SArchitectures
♦ ATLAS is faster than all other portable BLASimplementations and it is comparable withmachine-specific libraries provided by the vendor. 18
Intel PIII 933 MHzIntel PIII 933 MHzMKL 5.0MKL 5.0 vsvs ATLAS 3.2.0 using WindowsATLAS 3.2.0 using Windows 20002000
800
700
600
500
400
300
200
100
0
DGEMM DSYMM DSYRK DSYR2K DTRMM DTRSM
MFL
OPS
Vendor BLAS ATLAS BLAS Reference BLAS
BLAS
♦ ATLAS is faster than all other portable BLASimplementations and it is comparable withmachine-specific libraries provided by the vendor. 19
Matrix Vector Multiply DGEMVMatrix Vector Multiply DGEMV
0
50
100
150
200
250
300
MFL
OPS
Vendor NoTrans Vendor Trans ATLAS NoTrans Atlas Trans F77 NoTrans F77 NoTrans
000 00 63500 00
11233
2933
20 7
/1
0 62000 16 2-26 -25 -5 28- 0-Ion- 2- 3-56- 4- o35 II6 0 II c2
mPr 3r r
ECev ripe e m pa7 m phl v 6 w w00/
00iu
C e
BMPPC 00u
D At
IBM
Po
BMPo
enti Stiiu n 00
R12
0 a90E D nt Pe trM P UlD 1
HP Pe RA I n
SGII GI
Su
Architectures
S 20
LU Factorization, RecursiveLU Factorization, Recursive w/ATLASw/ATLAS
0
10 0
20 0
30 0
40 0
50 0
60 0
70 0
MFL
OPS
Ven dor B L AS ATLA S B L AS F77 B L AS
AMDAthl
on-600
DEC ev56
-533
DEC ev6-5
00 HP90
00/735
/135
IBMPPC60
4-112
IBM Power2
-160
IBM Power3
-200
Pentiu
m Pro-200
Pen
tium II-
266
Pentiu
m III-93
3
SGI R12
000ip
30-27
0
SGI R10
000ip
28-20
0
Sun U
ltraSparc
2-200
Architecture
♦ ATLAS is faster than all other portable BLASimplementations and it is comparable withmachine-specific libraries provided by the vendor. 21
Related Tuning ProjectsRelated Tuning Projects ♦ PHiPAC ¾ Portable High Performance ANSI C
www.icsi.berkeley.edu/~bilmes/phipac initial automatic GEMM generation project
♦ FFTW Fastest Fourier Transform in the West ¾ www.fftw.org
♦ UHFFT ¾ tuning parallel FFT algorithms¾ rodin.cs.uh.edu/~mirkovic/fft/parfft.htm
♦ SPIRAL¾ Signal Processing Algorithms Implementation Research for
Adaptable Libraries maps DSP algorithms to architectures ♦ Sparsity ¾ Sparse-matrix-vector and Sparse-matrix-matrix multiplication
www.cs.berkeley.edu/~ejim/publication/ tunes code to sparsitystructure of matrix more later in this tutorial
22
23
0500
10001500200025003000350040004500
100
300
500
700
900
Size
Mflo
p/s
Intel P4 1.5 GHz SSEIntel P4 1.5 GHzAMD Athlon 1GHzIntel IA64 666MHz
ATLAS Matrix Multiply ATLAS Matrix Multiply (64 & 32 bit floating point results)(64 & 32 bit floating point results)
32 bit floating point using SSE
MachineMachine--Assisted ApplicationAssisted Application Development and AdaptationDevelopment and Adaptation
♦ Communication libraries¾Optimize for the specifics of one’s configuration.
♦ Algorithm layout and implementation¾Look at the different ways to express implementation
24
Work in Progress:Work in Progress:ATLASATLAS--like Approach Applied to Broadcastlike Approach Applied to Broadcast (PII 8 Way Cluster with 100 Mb/s switched network)(PII 8 Way Cluster with 100 Mb/s switched network)
Root
Sequential Binary Ring
Binomial
Message Size Optimal algorithm Buffer Size (bytes) (bytes)
8 binomial 8 16 binomial 16 32 binary 32 64 binomial 64 128 binomial 128 256 binomial 256 512 binomial 512 1K sequential 1K 2K binary 2K 4K binary 2K 8K binary 2K 16K binary 4K 32K binary 4K 64K ring 4K 128K ring 4K 256K 512K
ring ring
4K 4K 25
1M binary 4K
Reformulating/Rearranging/ReuseReformulating/Rearranging/Reuse ♦ Example is the reduction to narrow band
from for the SVD
A TA = − u T −wvnew Ty = A unew
w = A vnew new
♦ Fetch each entry of A once ♦ Restructure and combined operations ♦ Results in a speedup of > 30%
26
CG Variants by DynamicCG Variants by Dynamic Selection at Run TimeSelection at Run Time♦ Variants combine
inner products toreduce communication bottleneck at the expense of morescalar ops.
♦ Same number of iterations, no advantage on asequential processor
♦ With a large numberof processor and a
may be advantages. ♦ Improvements can
range from 15% to50% depending onsize. 27
high-latency network
CG Variants by DynamicCG Variants by Dynamic Selection at Run TimeSelection at Run Time♦ Variants combine
inner products toreduce communication bottleneck at the expense of morescalar ops.
♦ Same number of iterations, no advantage on asequential processor
♦ With a large numberof processor and ahigh-latency networkmay be advantages.
♦ Improvements canrange from 15% to50% depending onsize. 28
Gaussian EliminationGaussian Elimination
0 x
x
x x
.
.
.
Standard Way
0
x
00
. . .
0
LINPACK subtract a multiple of a row apply sequence to a column
0
x
00
. . .
0 x
nb
a3
a2
a1
L
=L-1 a2a2 a3=a3-a1*a2nb LAPACK
apply sequence to nb then apply nb to rest of matrix29
Gaussian Elimination via aGaussian Elimination via a Recursive AlgorithmRecursive Algorithm
F. Gustavson and S. ToledoLU Algorithm:
1: Split matrix into two rectangles (m x n/2)if only 1 column, scale by reciprocal of pivot & return
2: Apply LU Algorithm to the left part
3: Apply transformations to right part (triangular solve A12 = L-1A12 and matrix multiplication A22=A22 -A21*A12 )
30
4: Apply LU Algorithm to right part
L A12
A21 A22
Most of the work in the matrix multiply Matrices of size n/2, n/4, n/8, …
Recursive FactorizationsRecursive Factorizations♦ Just as accurate as conventional method ♦ Same number of operations ♦ Automatic variable blocking ¾ Level 1 and 3 BLAS only !
♦ Extreme clarity and simplicity of expression ♦ Highly efficient ♦ The recursive formulation is just a
rearrangement of the point-wise LINPACKalgorithm
♦ The standard error analysis applies (assumingthe matrix operations are computed the“conventional” way).
31
32
DGEMM ATLAS & DGETRF RecursiveAMD Athlon 1GHz (~$1100 system)
0
100
200
300
400
500 1000 1500 2000 2500 3000
Order
MFl
op/s
Pentium III 550 MHz Dual ProcessorLU Factorization
800
600
Mflo
p /s
400
200
0
LAPACK
Recursive LU
Recursive LU
LAPACKUniprocessor
Dual-processor
500 1000 1500 2000 2500 3000 3500 4000 4500 5000
Order
33
DGEMM ATLAS & DGETRF RecursiveAMD Athlon 1GHz (~$1100 system)
0
500
1000
1500
2000
100 200 300 400 500 600
Order
MFl
op/s
Pentium III 933 MHzSGEMM
Windows 2000using SSE
2000
1500
1000
M flo
p/ s
500
0
100 200 300 400 500 600 700 800 900 1000
Order
34
DGEMM ATLAS & DGETRF RecursiveAMD Athlon 1GHz (~$1100 system)
0
500
1000
1500
2000
100 200 300 400 500 600
Order
MFl
op/s
Pentium III 933 MHzDGEMM
Windows 2000
800
600
400
Mflo
p /s
200
0
1 2 3 4 5 6 7 8 9 10
Order
35
DGEMM ATLAS & DGETRF RecursiveAMD Athlon 1GHz (~$1100 system)
0
500
1000
1500
2000
100 200 300 400 500 600
Order
MFl
op/s
Pentium III 933 MHzS, D, C, and Z GEMM
Windows 2000S & C use SSE
2000
1500
1000
Mflo
p /s
500
0
100 200 300 400 500 600 700 800 900 1000
Order
SuperLUSuperLU -- High PerformanceHigh Performance Sparse SolversSparse Solvers♦ SuperLU; X. Li and J. Demmel
¾ Solve sparse linear system A x = b usingGaussian elimination.
¾ Efficient and portable implementation on modern architectures:¾ Sequential SuperLU : PC and workstations
¾ Achieved up to 40% peak Megafloprate
¾ SuperLU_MT : shared-memory parallel machines¾ Achieved up to 10 fold speedup
¾ SuperLU_DIST : distributed-memory parallel machines¾ Achieved up to 100 fold speedup
¾ Support real and complex matrices, fill-reducing orderings, equilibration, numerical pivoting, condition estimation, iterative refinement, and error bounds.
♦ Enabled Scientific Discovery ¾ First solution to quantum scattering of 3
charged particles. [Recigno, Baertschy, Isaacs & McCurdy, Science, 24 Dec 1999]
¾ SuperLU solved complex unsymmetric systems of order up to 1.79 million, on the ASCI Blue Pacific Computer at LLNL.
36
Recursive Factorization AppliedRecursive Factorization Applied to Sparse Direct Methodsto Sparse Direct Methods
Victor Eijkhout, Piotr Luszczek & JD
1. Symbolic Factorization 2. Search for blocks that
contain non-zeros 3. Conversion to sparse
recursive storage 4. Search for embedded
blocks 5. Numerical factorization
Layout of sparse recursive matrix
37
Dense recursive factorizationDense recursive factorization
♦ The algorithm:
function rlu(A)
A
A
A
begin
rlu(A11); recursive call
21←A21 · U-1(A11); xTRSM() on upper triangular submatrix
12 ←L1-1(A11) · A12; xTRSM() on lower triangular submatrix
22 ←A22-A21·A12; xGEMM()
rlu(A22); recursive call
end.
♦ Replace xTRSM and xGEMM with sparse implementations that are themselves recursive 38
Sparse Recursive Factorization AlgorithmSparse Recursive Factorization Algorithm
♦ Solutions - continued¾fast sparse xGEMM() is two-level algorithm ¾recursive operation on sparse data structures ¾dense xGEMM() call when recursion reaches single block
♦ Uses Reverse Cuthill-McKee orderingcausing fill-in around the band
♦ No partial pivoting¾use iterative improvement or¾pivot only within blocks
39
Recursive storage conversion stepsMatrix divided into 2x2 blocks Matrix with explicit 0’s and fill-in
Recursive algorithm division lines
• - original nonzero value0 - zero value introduced due to
blocking x - zero value introduced due to fill-in
40
41
Breakdown of Time Across Phases
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
l t.
ing
iion
ion
it.
Breakdown of Time Across Phases For the Recursive Sparse Factorization
100% Numericafac
Ebedded block
Recurs ve convers
Block convers
Symbol cal fac
a f23560 e x11 goodwin jpwh_991 mcfe memplus ola fu orsre g_1 psmigr_1 ra e fsky_3 ra e fsky_4 sa ylr4 she rma n3 she rma n5 wa ng3 42
Di f f e r e nt Te st M a t r i c e s
SETI@homeSETI@home♦ Use thousands of Internet-
connected PCs to help inthe search for extraterrestrial intelligence.
♦ Uses data collected with the Arecibo Radio Telescope, in Puerto Rico
♦ When their computer is idleor being wasted thissoftware will download a 300 kilobyte chunk of datafor analysis.
♦ The results of this analysisare sent back to the SETI team, combined with thousands of other participants.
♦ Largest distributed computation project in existence ¾ ~ 400,000 machines
¾ Averaging 26 Tflop/s
♦ Today many companies trying this for profit.
43
Distributed and Parallel SystemsDistributed and Parallel Systems
memt
cennocr
t
er
swfsut
psol Massively
emoh
@
isDl
Distributed cfl ok
edesba
d
wsre
/
I TCSA
parallel
fntgna uwoeBl ellilpoi owrsystems
heterogeneous
it aupmoC
aTESI isut ceps
tr aPreNtnE
systems homo
rGi
C l
geneous
♦ Bounded set of resources ♦ Apps grow to consume all cycles ♦ Application manages resources ♦ System SW gets in the way ♦ 5% overhead is maximum ♦ Apps drive purchase of
equipment ♦ Real-time constraints ♦ Space-shared
♦ Gather (unused) resources ♦ Steal cycles ♦ System SW manages resources ♦ System SW adds value ♦ 10% - 20% overhead is OK ♦ Resources drive applications ♦ Time to completion is not
critical ♦ Time-shared
Napster
The GridThe Grid♦ To treat CPU cycles and software like commodities.
♦ on steroids.
♦ Enable the coordinated use of geographicallydistributed resources – in the absence of central control and existing trust relationships.
♦ Computing power is produced much like utilities suchas power and water are produced for consumers.
♦ Users will have access to “power” on demand
45
NetSolveNetSolve Network Enabled ServerNetwork Enabled Server
♦ NetSolve is an example of a grid basedhardware/software server.
♦ Easy-of-use paramount ♦ Based on a RPC model but with …¾ resource discovery, dynamic problem solving
capabilities, load balancing, fault toleranceasynchronicity, security, …
♦ Other examples are NEOS from Argonneand NINF Japan.
♦ Use a resource, not tie togethergeographically distributed resources fora single application. 46
NetSolve: The Big PictureNetSolve: The Big PictureClient Schedule
47
AGENT(s)
A C
S1 S2
S3 S4
A, B, C
Answer (C)
S2 !
Request
Op(C, A, B)
Matlab
Mathematica
C, Fortran
Java, Excel
Database
No knowledge of the grid required, RPC like.
Basic UsageBasic Usage ScenariosScenarios
♦ Grid based numerical library routines¾ User doesn’t have to have
software library on theirmachine, LAPACK, SuperLU,ScaLAPACK, PETSc, AZTEC, ARPACK
♦ Task farming applications¾ “Pleasantly parallel”
execution¾ eg Parameter studies
♦ Remote applicationexecution ¾ Complete applications with
user specifying inputparameters and receiving output
♦ “Blue Collar” Grid Based Computing ¾ Does not require deep
knowledge of network programming
¾ Level of expressiveness right for many users
¾ User can set things up, no “su” required
¾ In use today, up to 200 servers in 9 countries
48
Futures for Linear Algebra NumericalFutures for Linear Algebra Numerical Algorithms and SoftwareAlgorithms and Software♦ Numerical software will be adaptive,
exploratory, and intelligent♦ Determinism in numerical computing will be
gone.¾ After all, its not reasonable to ask for exactness in numerical computations.
¾ Audibility of the computation, reproducibility at a cost
♦ Importance of floating point arithmetic will beundiminished. ¾ 16, 32, 64, 128 bits and beyond.
♦ Reproducibility, fault tolerance, and auditability ♦ Adaptivity is a key so applications can function
appropriately
49
Contributors to These IdeasContributors to These Ideas♦ Top500 ¾ Erich Strohmaier, LBL ¾ Hans Meuer, Mannheim U
♦ Linear Algebra ¾ Victor Eijkhout, UTK ¾ Piotr Luszczek, UTK ¾ Antoine Petitet, UTK ¾ Clint Whaley, UTK
♦ NetSolve ¾ Dorian Arnold, UTK ¾ Susan Blackford, UTK ¾ Henri Casanova, UCSD ¾ Michelle Miller, UTK ¾ Sathish Vadhiyar, UTK
For additional information see…
www.netlib.org/atlas/ www.netlib.org/netsolve/
www.cs.utk.edu/~dongarra/
Many opportunities within the group at Tennessee 50
Intel® Math Kernel Library 5.1 for IA32 andIntel® Math Kernel Library 5.1 for IA32 andItaniumItanium™ Processor™ Processor--based Linux applications Betabased Linux applications Beta License Agreement PRELicense Agreement PRE--RELEASERELEASE
♦ The Materials are pre-release code, which may not be fully functional and which Intel may substantially modify in producing any final version. Intel can provide no assurance that it will ever produce or make generally available a final version. You agree to maintain as confidential all information relating to your use of the Materials and not to disclose to any third party any benchmarks, performance results, or other information relating to the Materials comprising the pre-release.
♦ See: http://developer.intel.com/software/products/mkl/mkllicense51_lnx.htm
51
top related