![Page 1: Computer Architecture Vector Architectures Ola Flygt Växjö University Ola.Flygt@msi.vxu.se +46 470 70 86 49](https://reader035.vdocument.in/reader035/viewer/2022062423/56649c785503460f9492e0ce/html5/thumbnails/1.jpg)
Computer Computer ArchitectureArchitecture
Vector ArchitecturesVector ArchitecturesOla Flygt
Växjö Universityhttp://w3.msi.vxu.se/users/ofl/
[email protected]+46 470 70 86 49
![Page 2: Computer Architecture Vector Architectures Ola Flygt Växjö University Ola.Flygt@msi.vxu.se +46 470 70 86 49](https://reader035.vdocument.in/reader035/viewer/2022062423/56649c785503460f9492e0ce/html5/thumbnails/2.jpg)
Outline
IntroductionBasic priciplesSdSdExamples
Crayxcx
CH01
![Page 3: Computer Architecture Vector Architectures Ola Flygt Växjö University Ola.Flygt@msi.vxu.se +46 470 70 86 49](https://reader035.vdocument.in/reader035/viewer/2022062423/56649c785503460f9492e0ce/html5/thumbnails/3.jpg)
Scalar processing
4n clock cycles required to process n elements!
Time op
0 a0
4 a1
8 a2
… …
4n an
![Page 4: Computer Architecture Vector Architectures Ola Flygt Växjö University Ola.Flygt@msi.vxu.se +46 470 70 86 49](https://reader035.vdocument.in/reader035/viewer/2022062423/56649c785503460f9492e0ce/html5/thumbnails/4.jpg)
Pipelining
4n/(4+n) clock cycles required to process n elements!
Time
op0 op1 op2 op3
0 a0
1 a1 a0
2 a2 a1 a0
3 a3 a2 a1 a0
4 a4 a3 a2 a1
… … … … …
n an an-1 an-2 an-3
![Page 5: Computer Architecture Vector Architectures Ola Flygt Växjö University Ola.Flygt@msi.vxu.se +46 470 70 86 49](https://reader035.vdocument.in/reader035/viewer/2022062423/56649c785503460f9492e0ce/html5/thumbnails/5.jpg)
PipelineBasic Principle
Stream of objects Number of objects = stream length n
Operation can be subdivided into sequence of steps Number of steps = pipeline length p
Advantage Speedup = pn/(p+n)
Stream length >> pipeline length Speedup approx.p
Speedup is limited by pipeline length!
![Page 6: Computer Architecture Vector Architectures Ola Flygt Växjö University Ola.Flygt@msi.vxu.se +46 470 70 86 49](https://reader035.vdocument.in/reader035/viewer/2022062423/56649c785503460f9492e0ce/html5/thumbnails/6.jpg)
Vector Operations
Operations on vectors of data (floating point numbers) Vector-vector
V1 <-V2 + V3 (component-wise sum) V1 <-- V2
Vector-scalar V1 <-c * V2
Vector-memory V <-A (vector load) A <-V (vector store)
Vector reduction c <-min(V) c <-sum(V) c <-V1 * V2 (dot product)
![Page 7: Computer Architecture Vector Architectures Ola Flygt Växjö University Ola.Flygt@msi.vxu.se +46 470 70 86 49](https://reader035.vdocument.in/reader035/viewer/2022062423/56649c785503460f9492e0ce/html5/thumbnails/7.jpg)
Vector Operations, cont.
Gather/scatter V1,V2 <-GATHER(A)
load all non-zero elements of A into V1 and their indices into V2
A <-SCATTER(V1,V2) store elements of V1 into A at indices denoted by V2 and fill
rest with zeros
Mask V1 <-MASK(V2,V3) store elements of V2 into V1 for which corresponding
position in V3 is non-zero
![Page 8: Computer Architecture Vector Architectures Ola Flygt Växjö University Ola.Flygt@msi.vxu.se +46 470 70 86 49](https://reader035.vdocument.in/reader035/viewer/2022062423/56649c785503460f9492e0ce/html5/thumbnails/8.jpg)
Example, Scalar Loop
approx. 6n clock cycles to execute loop.
Fortran loop:
DO I=1,N A(I) = A(I)+B(I)ENDDO
Scalar assembly code:
R0 <- NR1 <- IJMP J
L: R2 <- A(R1)R3 <- B(R1)R2 <- R2+R3A(R1) <- R2R1 <- R1+1
J: JLE R1, R0, L
![Page 9: Computer Architecture Vector Architectures Ola Flygt Växjö University Ola.Flygt@msi.vxu.se +46 470 70 86 49](https://reader035.vdocument.in/reader035/viewer/2022062423/56649c785503460f9492e0ce/html5/thumbnails/9.jpg)
Example, Vector Loop
4n clock cycles, because no loop iteration overhead (ignoring speedup by pipelining)
Fortran loop:
DO I=1,N A(I) = A(I)+B(I)ENDDO
Vectorized assembly code:
V1 <- AV2 <- BV3 <- V1+V2A <- V2
![Page 10: Computer Architecture Vector Architectures Ola Flygt Växjö University Ola.Flygt@msi.vxu.se +46 470 70 86 49](https://reader035.vdocument.in/reader035/viewer/2022062423/56649c785503460f9492e0ce/html5/thumbnails/10.jpg)
Chaining
Overlapping of vector instructions (see Hwang, Figure 8.18)
Hence: c+n ticks (for small c) Speedup approx.6 (c=16, n=128, s=(6*128)/(16+128)=5.33)
The longer the vector chain, the better the speedup! A <-B*C+D chaining degree 5
Vectorization speedups between 5 and 25
![Page 11: Computer Architecture Vector Architectures Ola Flygt Växjö University Ola.Flygt@msi.vxu.se +46 470 70 86 49](https://reader035.vdocument.in/reader035/viewer/2022062423/56649c785503460f9492e0ce/html5/thumbnails/11.jpg)
Vector Programming
How to generate vectorized code?
1.Assembly programming.2.Vectorized Libraries.3.High-level vector statements.4.Vectorizing compiler.
![Page 12: Computer Architecture Vector Architectures Ola Flygt Växjö University Ola.Flygt@msi.vxu.se +46 470 70 86 49](https://reader035.vdocument.in/reader035/viewer/2022062423/56649c785503460f9492e0ce/html5/thumbnails/12.jpg)
Vectorized Libraries
Predefined vector operations (partially implemented in assembly language) VECLIB, LINPACK, EISPACK, MINPACK
C = SSUM(100, A(1,2), 1, B(3,1), N)100 ...vector length
A(1,2) ...vector address A1 ...vector stride A
B(3,1) ...vector address BN ...vector stride B
Addition of matrix column to matrix row.
![Page 13: Computer Architecture Vector Architectures Ola Flygt Växjö University Ola.Flygt@msi.vxu.se +46 470 70 86 49](https://reader035.vdocument.in/reader035/viewer/2022062423/56649c785503460f9492e0ce/html5/thumbnails/13.jpg)
High-Level Vector Statements
e.g. Fortran 90
INTEGER A(100), B(100), C(100), S A(1:100) = S*B(1:100)+C(1:100)
* Vector-vector operations. * Vector-scalar operations. * Vector reduction. * ...
Easy transformation into vector code.
![Page 14: Computer Architecture Vector Architectures Ola Flygt Växjö University Ola.Flygt@msi.vxu.se +46 470 70 86 49](https://reader035.vdocument.in/reader035/viewer/2022062423/56649c785503460f9492e0ce/html5/thumbnails/14.jpg)
Vectorizing Compiler 1. Fortran 77 DO Loop *
DO I=1, N D(I) = A(I)*B+C(I) ENDDO
2. Vectorization *
D(1:N) = A(1:N)*B+C(1:N)
3. Strip mining *
DO I=1, N/128 D(I:I+127) = A(I:I+127)*B + C(I:I+127) ENDDO IF ((N.MOD.128).NEQ.0) A((N/128)*128+1:N) = ... ENDIF
4. Code generation *
V0 <- V0*B ...
Related techniques for parallelizing compiler!
![Page 15: Computer Architecture Vector Architectures Ola Flygt Växjö University Ola.Flygt@msi.vxu.se +46 470 70 86 49](https://reader035.vdocument.in/reader035/viewer/2022062423/56649c785503460f9492e0ce/html5/thumbnails/15.jpg)
Vectorization
In which cases can loop be vectorized?
DO I = 1, N-1 A(I) = A(I+1)*B(I)ENDDO
| V
A(1:128) = A(2:129)*B(1:128)A(129:256) = A(130:257)*B(129:256)....
Vectorization preserves semantics.
![Page 16: Computer Architecture Vector Architectures Ola Flygt Växjö University Ola.Flygt@msi.vxu.se +46 470 70 86 49](https://reader035.vdocument.in/reader035/viewer/2022062423/56649c785503460f9492e0ce/html5/thumbnails/16.jpg)
Loop Vectorization
s semantics always preserved?
DO I = 2, N A(I) = A(I-1)*B(I)ENDDO
| V
A(2:129) = A(1:128)*B(2:129)A(130:257) = A(129:256)*B(130:257)....
Vectorization has changed semantics!
![Page 17: Computer Architecture Vector Architectures Ola Flygt Växjö University Ola.Flygt@msi.vxu.se +46 470 70 86 49](https://reader035.vdocument.in/reader035/viewer/2022062423/56649c785503460f9492e0ce/html5/thumbnails/17.jpg)
Vectorization Inhibitors
Vectorization must be conservative; when in doubt, loop must not be vectorized.
Vectorization is inhibited byFunction callsInput/output operationsGOTOs into or out of loopRecurrences (References to vector elements
modified in previous iterations)
![Page 18: Computer Architecture Vector Architectures Ola Flygt Växjö University Ola.Flygt@msi.vxu.se +46 470 70 86 49](https://reader035.vdocument.in/reader035/viewer/2022062423/56649c785503460f9492e0ce/html5/thumbnails/18.jpg)
Components of a vectorizing supercomputer
![Page 19: Computer Architecture Vector Architectures Ola Flygt Växjö University Ola.Flygt@msi.vxu.se +46 470 70 86 49](https://reader035.vdocument.in/reader035/viewer/2022062423/56649c785503460f9492e0ce/html5/thumbnails/19.jpg)
The DS for floating-point precision
![Page 20: Computer Architecture Vector Architectures Ola Flygt Växjö University Ola.Flygt@msi.vxu.se +46 470 70 86 49](https://reader035.vdocument.in/reader035/viewer/2022062423/56649c785503460f9492e0ce/html5/thumbnails/20.jpg)
The DS for integer precision
![Page 21: Computer Architecture Vector Architectures Ola Flygt Växjö University Ola.Flygt@msi.vxu.se +46 470 70 86 49](https://reader035.vdocument.in/reader035/viewer/2022062423/56649c785503460f9492e0ce/html5/thumbnails/21.jpg)
How vectorization worksUn-vectorized computation
![Page 22: Computer Architecture Vector Architectures Ola Flygt Växjö University Ola.Flygt@msi.vxu.se +46 470 70 86 49](https://reader035.vdocument.in/reader035/viewer/2022062423/56649c785503460f9492e0ce/html5/thumbnails/22.jpg)
How vectorization worksvectorized computation
![Page 23: Computer Architecture Vector Architectures Ola Flygt Växjö University Ola.Flygt@msi.vxu.se +46 470 70 86 49](https://reader035.vdocument.in/reader035/viewer/2022062423/56649c785503460f9492e0ce/html5/thumbnails/23.jpg)
How vectorization speeds up computation
![Page 24: Computer Architecture Vector Architectures Ola Flygt Växjö University Ola.Flygt@msi.vxu.se +46 470 70 86 49](https://reader035.vdocument.in/reader035/viewer/2022062423/56649c785503460f9492e0ce/html5/thumbnails/24.jpg)
Speed improvementsNon-pipelined computation
![Page 25: Computer Architecture Vector Architectures Ola Flygt Växjö University Ola.Flygt@msi.vxu.se +46 470 70 86 49](https://reader035.vdocument.in/reader035/viewer/2022062423/56649c785503460f9492e0ce/html5/thumbnails/25.jpg)
Speed improvementspipelined computation
![Page 26: Computer Architecture Vector Architectures Ola Flygt Växjö University Ola.Flygt@msi.vxu.se +46 470 70 86 49](https://reader035.vdocument.in/reader035/viewer/2022062423/56649c785503460f9492e0ce/html5/thumbnails/26.jpg)
Increasing the granularity of a pipelineRepetition governed by slowest
component
![Page 27: Computer Architecture Vector Architectures Ola Flygt Växjö University Ola.Flygt@msi.vxu.se +46 470 70 86 49](https://reader035.vdocument.in/reader035/viewer/2022062423/56649c785503460f9492e0ce/html5/thumbnails/27.jpg)
Increasing the granularity of a pipelineGranularity increased to improve
repetition
![Page 28: Computer Architecture Vector Architectures Ola Flygt Växjö University Ola.Flygt@msi.vxu.se +46 470 70 86 49](https://reader035.vdocument.in/reader035/viewer/2022062423/56649c785503460f9492e0ce/html5/thumbnails/28.jpg)
Parallel computation of floating point and integer
results
![Page 29: Computer Architecture Vector Architectures Ola Flygt Växjö University Ola.Flygt@msi.vxu.se +46 470 70 86 49](https://reader035.vdocument.in/reader035/viewer/2022062423/56649c785503460f9492e0ce/html5/thumbnails/29.jpg)
Mixed functional and data parallelism
![Page 30: Computer Architecture Vector Architectures Ola Flygt Växjö University Ola.Flygt@msi.vxu.se +46 470 70 86 49](https://reader035.vdocument.in/reader035/viewer/2022062423/56649c785503460f9492e0ce/html5/thumbnails/30.jpg)
The DS for parallel computational
functionality
![Page 31: Computer Architecture Vector Architectures Ola Flygt Växjö University Ola.Flygt@msi.vxu.se +46 470 70 86 49](https://reader035.vdocument.in/reader035/viewer/2022062423/56649c785503460f9492e0ce/html5/thumbnails/31.jpg)
Performance of four generations of Cray
systems
![Page 32: Computer Architecture Vector Architectures Ola Flygt Växjö University Ola.Flygt@msi.vxu.se +46 470 70 86 49](https://reader035.vdocument.in/reader035/viewer/2022062423/56649c785503460f9492e0ce/html5/thumbnails/32.jpg)
Communication between CPUs and memory
![Page 33: Computer Architecture Vector Architectures Ola Flygt Växjö University Ola.Flygt@msi.vxu.se +46 470 70 86 49](https://reader035.vdocument.in/reader035/viewer/2022062423/56649c785503460f9492e0ce/html5/thumbnails/33.jpg)
The increasing complexity in Cray systems
![Page 34: Computer Architecture Vector Architectures Ola Flygt Växjö University Ola.Flygt@msi.vxu.se +46 470 70 86 49](https://reader035.vdocument.in/reader035/viewer/2022062423/56649c785503460f9492e0ce/html5/thumbnails/34.jpg)
Integration density
![Page 35: Computer Architecture Vector Architectures Ola Flygt Växjö University Ola.Flygt@msi.vxu.se +46 470 70 86 49](https://reader035.vdocument.in/reader035/viewer/2022062423/56649c785503460f9492e0ce/html5/thumbnails/35.jpg)
Convex C4/XA system
![Page 36: Computer Architecture Vector Architectures Ola Flygt Växjö University Ola.Flygt@msi.vxu.se +46 470 70 86 49](https://reader035.vdocument.in/reader035/viewer/2022062423/56649c785503460f9492e0ce/html5/thumbnails/36.jpg)
The configuration of the crossbar switch
![Page 37: Computer Architecture Vector Architectures Ola Flygt Växjö University Ola.Flygt@msi.vxu.se +46 470 70 86 49](https://reader035.vdocument.in/reader035/viewer/2022062423/56649c785503460f9492e0ce/html5/thumbnails/37.jpg)
The processor configuration