![Page 1: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999](https://reader037.vdocument.in/reader037/viewer/2022110322/56649d595503460f94a3896f/html5/thumbnails/1.jpg)
The Future of Vector Processors
M. Valero, R. Espasa and J. Corbal
UPC, Barcelona
Kyoto, May 28th, 1999
![Page 2: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999](https://reader037.vdocument.in/reader037/viewer/2022110322/56649d595503460f94a3896f/html5/thumbnails/2.jpg)
Kyoto, May 28th. 1999 2
TOP-500 and Vector Processors
0
50
100
150
200
250
300
350
# Systems % Peak Perf.
310
96
4315
65
November 98
Fujitsu…27
NEC……18
SGI……..15
Hitachi….5
![Page 3: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999](https://reader037.vdocument.in/reader037/viewer/2022110322/56649d595503460f94a3896f/html5/thumbnails/3.jpg)
Kyoto, May 28th. 1999 3
The Future of Vector ISA’s
• Cross-Pollination of Vector/Superscalar/VLIW– MMX, Embedded...
• Very-high Performance Architectures– ILP techniques, IRAM, SDRAM
• Vector Microprocessors– Numerical Accelerators– Multimedia Applications
![Page 4: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999](https://reader037.vdocument.in/reader037/viewer/2022110322/56649d595503460f94a3896f/html5/thumbnails/4.jpg)
Kyoto, May 28th. 1999 4
Talk Outline• The Past :
• Initial Motivation for Vector ISA• Evolution of Vector Processors
• The Present :• Recent Announcements• The Decline of Vector Processors• Cross-Pollination of Vector/Superscalars/VLIW
• The Future :• Very-high Performance Architectures• Vector Microprocessors
– Numerical Accelerators– Multimedia Applications
• Conclusions
![Page 5: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999](https://reader037.vdocument.in/reader037/viewer/2022110322/56649d595503460f94a3896f/html5/thumbnails/5.jpg)
Kyoto, May 28th. 1999 5
Characteristics of Numerical Applications
• Examples: Weather prediction, mechanical engineering
• Data structures: Huge matrices (dense, sparse)
• Data types: 64 bits, floating point
• Highly repetitive loops
• Compute-intensive
• Data-Level Parallel
![Page 6: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999](https://reader037.vdocument.in/reader037/viewer/2022110322/56649d595503460f94a3896f/html5/thumbnails/6.jpg)
Kyoto, May 28th. 1999 6
Initial Motivations for Vector Processors
real*8 x(9992), y(9992), u(9984) subroutine loop integer I real*8 q do I=1,9984 q = u(I) * y(I) y(I) = x(I) + q x(I) = q - u(I) * x(I) enddo end
x(I)y(I) u(I)
*
*
+_
q
For I=1 to 9984
Dependence Graph
![Page 7: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999](https://reader037.vdocument.in/reader037/viewer/2022110322/56649d595503460f94a3896f/html5/thumbnails/7.jpg)
Kyoto, May 28th. 1999 7
Execution of scalar codeLoop : ld R1,0(R10) ld R2,0(R11) ld R3,0(R12) mulf R4,R1,R2) mulf R5,R2,R3 add R11,R11,#8 addf R6,R4,R3 subf R7,R4,R5 st 0(R12),R7 add R10, R10,#8 st 0(R12),R7 sub R13,R13,#1 bne Loop add R12,R12,#8
M WD/L ALUIF M W
M WD/L ALUIF M W
M WD/L ALUIF M W
M WD/L ALUIF M W
WWD/L ALUIF ALU ALU
WWD/L ALUIF ALU ALU
WWD/L ALUIF ALU ALU
WWD/L ALUIF ALU ALU
M WD/L ALUIF M W
M WD/L ALUIF M W
M WD/L ALUIF M W
M WD/L ALUIF M W
M WD/L ALUIF M W
M WD/L ALUIF M W
M WD/L ALUIF M W
D/LIF MALU
14 cycles / Iteration
Perfect Memory !!!
![Page 8: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999](https://reader037.vdocument.in/reader037/viewer/2022110322/56649d595503460f94a3896f/html5/thumbnails/8.jpg)
Kyoto, May 28th. 1999 8
Generation of Vector Code
Loop : mov s2, vl ; vl <- min(s2,128) ld.l -y(a2),v0 ; v0 <- y(I:I+127) ld.l -u(a2),v1 ; v1 <- u(I:I+127) mul.d v1,v0,v2 ; q(I:I+127) <- u(I:I+127)*y() ld.l -x(a2),v3 ; v3 <-x(I:I+127) add.d v3,v2,v0 ; v0 <- x(I:I+127) + q(I:I+127) st.l v0,-y(a2) ; y(I:I+127) <- x(I:I+127) + q( ) mul.d v1,v3,v1 ; v1 <- u(I:I+127) *x(I:I+127) sub.d v2,v1,v0 ; v0 <- q( ) - u( ) * x( ) st.l v0,-x(a2) ; x(I:I+127) <- q( ) - u( ) * x( ) add.w #1024,a2 ; increment index (128 * 8) add.w # -128,s2 ; 128 iterations less to process lt.w # 0,s2 jbrs.t loop
ld.w #9984,s2 ld.w #0,a2ld.w #8,vs
… . … . … . … . … . … . … . … . … .
0 1 2 127
A vector iteration is equivalent to 128 scalar iterations
DLP !!!
![Page 9: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999](https://reader037.vdocument.in/reader037/viewer/2022110322/56649d595503460f94a3896f/html5/thumbnails/9.jpg)
Kyoto, May 28th. 1999 9
Execution of vector codeLoop : mov s2, vl ld.l -y(a2),v0 ld.l -u(a2),v1 mul.d v1,v0,v2 ld.l -x(a2),v3 add.d v3,v2,v0 st.l v0,-y(a2) mul.d v1,v3,v1 sub.d v2,v1,v0 st.l v0,-x(a2) add.w #1024,a2 add.w # - 128,s2 lt.w #0,s2 jbrs.t loop
5.1 cycles / Iteration
Memory Latency = 24 cycles !!!
14 vector instructions = 1792 scalar instructions
One L/S Port
One Adder, One Multiplier
A vector iteration is equivalent to 128 scalar iterations
![Page 10: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999](https://reader037.vdocument.in/reader037/viewer/2022110322/56649d595503460f94a3896f/html5/thumbnails/10.jpg)
Kyoto, May 28th. 1999 10
Vector Processor
ControlUnit
Main Memory
Instructions (scalar + vector) + Data
Ri := Rj op Rk
Branch (cond.)
Instr. . . .
Vector Reg.
. . .
Scalar Reg.
Vector dataScalar data VR[i] := VR[j] op VR[k]
![Page 11: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999](https://reader037.vdocument.in/reader037/viewer/2022110322/56649d595503460f94a3896f/html5/thumbnails/11.jpg)
Kyoto, May 28th. 1999 11
Why Vector ISA ?
• Natural way to express Data-Level Parallelism– Fewer instructions ( 3 )
• Easy way to convey this information to the hardware
• Good hardware implementation– Affordable/ incremental parallelism ( 2 )
– Simple control/ faster clock ( 1 )
• Mechanism to deal with memory latency• Problem : Memory Bandwidth...
![Page 12: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999](https://reader037.vdocument.in/reader037/viewer/2022110322/56649d595503460f94a3896f/html5/thumbnails/12.jpg)
Kyoto, May 28th. 1999 12
Vector versus Scalar Architectures
0
20
40
60
80
100
120
R10k Convex C3
Number of instructions (in millions)
Vector instruction semantics “encode” many different scalar instructions :
- Loop counters
- Branch computations
- Addresses generation
F. Quintana, R. Espasa and M. Valero “ A case for merging the ILP..” PDP-98
Rate from 140 to 2
![Page 13: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999](https://reader037.vdocument.in/reader037/viewer/2022110322/56649d595503460f94a3896f/html5/thumbnails/13.jpg)
Kyoto, May 28th. 1999 13
Easy to convey information to the hardware• Data path :
• No pressure at fetch, decode and issue
• Decentralized control
• Faster cycle times
• Vector memory instructions :• Spatial locality can be made clearly visible to the
hardware through “strides”
• No overhead and good prefetching
• Reduction of memory latency overhead
• Memory uses facts, not guesses
![Page 14: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999](https://reader037.vdocument.in/reader037/viewer/2022110322/56649d595503460f94a3896f/html5/thumbnails/14.jpg)
Kyoto, May 28th. 1999 14
Key parameters for vector processors
• Cycle time• Scalar processor:
– # of registers and FU’s – Cache
• Vector processor– # of vector registers– # of FU’s and # of pipes/ FU
• Connection to memory:– # of busses and width
• Number of processors
![Page 15: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999](https://reader037.vdocument.in/reader037/viewer/2022110322/56649d595503460f94a3896f/html5/thumbnails/15.jpg)
Kyoto, May 28th. 1999 15
Cray Y-MP Architecture
P0
P1
P7
4*4
4*4
4*4
8*8
8*8
0 4 28
3 7 31
224
228 231 255
228 232
Synchronization
tc = 6 ns.
333 Mflops / processor
256 modules. ta = 30 ns.
![Page 16: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999](https://reader037.vdocument.in/reader037/viewer/2022110322/56649d595503460f94a3896f/html5/thumbnails/16.jpg)
Kyoto, May 28th. 1999 16
Vector Processors (1 of 2)
Year Machine Tc (ns) #FPU’sFlops/cycle
LD/ST path
words/ cycle
#regsElements / register
1972 TI-ASC 60 2 4 LS 4(32) - -1973 STAR-100 40 2 2 L,L,S 3 - -1975 Cray-1 12.5 2 2 LS 1 8 64
1982Fujitsu VP 2000 7 2 4 LS,LS 4 8-256 1024-32
1983 Cray-XMP 9.5 2 2 L,L,S 2+1 8 64
1983Hitachi S810/20 19/14 6?? 12?? L,L,L,LS 8 or 2 32 256
1984 NEC-SX2 6 4 16 L,LS 8 or 4 8+8k 256/64-2561985 Cray-2 4.1 2 2 LS 1 8 64
1987Hitachi S820/80 4 3 12 L,LS 8 or 4 32 512
![Page 17: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999](https://reader037.vdocument.in/reader037/viewer/2022110322/56649d595503460f94a3896f/html5/thumbnails/17.jpg)
Kyoto, May 28th. 1999 17
Vector Processors (2 of 2 )
Year Machine Tc (ns) #FPU’sFlops/cycle
LD/ST path
words/ cycle
#regsElements / register
1987 Convex C2 40 2 2 LS 1 8 128
1988 Cray Y-MP6.3 2 2 L,L,S 2+1 8 64
1989Fujitsu VP 2600 3.2 4 16 LS,LS 8 2048-64 64-2048
1990 NEC SX-3 2.9 4 16 L,L,S 8+4 8+16k 256/64-2561992 Cray C90 4 2 4 L,L,S 4+2 8 128
1993Hitachi S-3800 2 2(?) 16(?) L,L,L,LS 8 or 2 - -
1994 Convex C4 7.4 2 2 LS 1 8 1281996 Nec SX-4 8 2 16 LS,LS 16 8+16k 256/64-2561998 Nec SX-5 4 2 32 LS,LS 32 8+16k 256/64-256
![Page 18: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999](https://reader037.vdocument.in/reader037/viewer/2022110322/56649d595503460f94a3896f/html5/thumbnails/18.jpg)
Kyoto, May 28th. 1999 18
Evolution of Cray Machines
Machine Year Tc MhzMflops/CPU # CPU's
Memory BW/CPU
Load latency(ns)
Cray-1 1976 80 160 1 640 MB/s 150Cray-XMP 1982 105 210 2 2.5 GB/s 123Cray-2 1982 243 486 4 or 8 1.9 GB/s 200Cray-YMP 1989 167 334 8 4 GB/s 100Cray-C90 1992 243 970 16 12 GB/s 95Cray-J90 1995 100 200 32 1.6 GB/S 340Cray-T90 1994 450 1800 32 21 GB/s 70/116Cray-SV-1 1998
Courtesy from SGI/CRAY
Tc : x6 ILP : x2 # of proc. x32 Total : x400
![Page 19: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999](https://reader037.vdocument.in/reader037/viewer/2022110322/56649d595503460f94a3896f/html5/thumbnails/19.jpg)
Kyoto, May 28th. 1999 19
Vector Innovations (1 of 2 ) • Star-100/Cyber-200 had many of them:
– Gather/scatter– Masked operations for conditionals
• Cray-1 introduced vector registers• BSP had instructions for recurrences and
multioperand • Instructions to optimize masked vector
operations• Instructions to handle Index and Bit sequence
on mask register• Flexible addressing of subvector registers(C4)
![Page 20: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999](https://reader037.vdocument.in/reader037/viewer/2022110322/56649d595503460f94a3896f/html5/thumbnails/20.jpg)
Kyoto, May 28th. 1999 20
Vector Innovations ( 2 of 2 )
• Multi-pipes (Star/Cyber)
• Vector with Virtual Memory
• Flexible chaining (multi-ported register-file)
• Multilevel register-file (NEC)
• Scalar units sharing vector FU’s (Fujitsu)
• Combined vector and scalar instructions (Titan)
• Short vectors (CS-2 and CM-5)
• Scalar processor: LIW( Fujitsu), SS(NEC)
![Page 21: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999](https://reader037.vdocument.in/reader037/viewer/2022110322/56649d595503460f94a3896f/html5/thumbnails/21.jpg)
Kyoto, May 28th. 1999 21
Automatic vectorization
• Compiler technology for vectorization: over 25 years of development– Dependence analysis– Elimination of false dependences– Strip mining– Loop interchange– Partial vectorization– Idiom recognition– IF conversion– Vector parallelization
![Page 22: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999](https://reader037.vdocument.in/reader037/viewer/2022110322/56649d595503460f94a3896f/html5/thumbnails/22.jpg)
Kyoto, May 28th. 1999 22
Vector Architectures : Present
• New announcements (NEC, Cray, Fujitsu)
• The decline of vector processors
• Cross-pollination of Vector/ Superscalar/
VLIW processors
![Page 23: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999](https://reader037.vdocument.in/reader037/viewer/2022110322/56649d595503460f94a3896f/html5/thumbnails/23.jpg)
Kyoto, May 28th. 1999 23
NEC SX-5
• Announced on June 5th. of 1998
• 8 Gflops, CMOS, tc = 4 ns
• Superscalar processor at 500 Mflops
• 32 results/cycle (2 FPU, 16-pipe)
• 32 data memory accesses/cycle (2 ports,16 data/port). Memory bandwidth of 64 GB/s
• System composed by 32 nodes of 128 Gflops providing 4 Tflop/s
![Page 24: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999](https://reader037.vdocument.in/reader037/viewer/2022110322/56649d595503460f94a3896f/html5/thumbnails/24.jpg)
Kyoto, May 28th. 1999 24
Cray SV1• Announced on June 16th. of 1998
• CMOS, 250 Mhz and 4 Gigaflop/proc.
• Vector cache memory
• 2 FU’s of 8 operations/cycle
• “Multi-Streaming” Processor
• Scalable vector architecture (32 nodes of 32 processors…4 Teraflops)
• Future processor enhancements !!!
![Page 25: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999](https://reader037.vdocument.in/reader037/viewer/2022110322/56649d595503460f94a3896f/html5/thumbnails/25.jpg)
Kyoto, May 28th. 1999 25
Fujitsu VP5000
• Announced on April 20 th. of 1999
• 9.2 Gflop/s, CMOS, 0.22 micr, 33 Mtrs/chip
• Linpack 1000*1000 gives 8758 Mflop/s
• Crossbar provides 2*1.6 GB/s per processor
• System composed by 512 PE’s or 4.9 Teraflops
• Maximum of 16 GB/PE or 8 TB/512 PE’s
![Page 26: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999](https://reader037.vdocument.in/reader037/viewer/2022110322/56649d595503460f94a3896f/html5/thumbnails/26.jpg)
Kyoto, May 28th. 1999 26
The decline of vector processors
• Why have vector machines declined so fast in popularity?– Cost (Scalar parallel machines use
commodity parts)– Too restricted in applications (lack of
vectorization in many programs)
• Massive use of computers to run so called “Non-numerical Applications”
![Page 27: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999](https://reader037.vdocument.in/reader037/viewer/2022110322/56649d595503460f94a3896f/html5/thumbnails/27.jpg)
Kyoto, May 28th. 1999 27
Characteristics of non-numerical Applications
• Examples: OLTP,DSS, simulators, games…
• General data structures: Lists, trees, tables…
• Data types: Scalar integers of 8 to 64 bits
• Frequent control flow change…Speculation
• Short distance data dependencies... Forwarding
• Instruction/data locality……Caches
• Fine-grain ILP……..Out-of-order
![Page 28: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999](https://reader037.vdocument.in/reader037/viewer/2022110322/56649d595503460f94a3896f/html5/thumbnails/28.jpg)
Kyoto, May 28th. 1999 28
Micro Killers ???
Year Machine Tc (Mhz) #op/cyclePeak Perf. Mflops
1976 Cray-1 80 2 1601978 I-8086 10 - -1992 Cray C-90 243 4 9701992 Alpha 21064 150 1 1501994 Pentium 100 1 1001996 NEC SX-4 125 16 20001997 IBM P2SC 160 4* 6401997 Alpha 21164 500 2 10001998 HP PA8200 240 4* 9601998 NEC SX-5 250 32 80001998 Pentium 400 1 400
Peak performance = Tc * ILP
![Page 29: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999](https://reader037.vdocument.in/reader037/viewer/2022110322/56649d595503460f94a3896f/html5/thumbnails/29.jpg)
Kyoto, May 28th. 1999 29
Bandwidth and PerformanceAlpha21264500 Mhz
Power chipIBM 160 Mhz
HP-8200240 Mhz
Cray T90450 Mhz
NEC SX-4125 Mhz
2 GB/s 2 GB/s 24 GB/S 16 GB/S16 MB
5 Gb/s 768 MB/s
64 KB 128 KB 2 MB8 GB/s 3.84 GB/s 24 GB/s 16 GB/s
576 bytes 704 bytes 8 KB 128 KB
16 GB/s 5.12 GB/s 15.3 GB/s 43.2GB/s 48 GB/s2 FPU1 Gflops
2 (2 pipe)640 Mflops
2 (2 pipe)960Mflops
2 (2 pipe)1.8Gflops
2 (8 pipes)2 Gflops
Main memory
Register file size
Functional Units
L1 cache size
L2 cache size
![Page 30: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999](https://reader037.vdocument.in/reader037/viewer/2022110322/56649d595503460f94a3896f/html5/thumbnails/30.jpg)
Kyoto, May 28th. 1999 30
Peak performance and Bandwidth
0102030405060708090
100
0 1000 2000 3000 4000Vector length
* Measurement condition : RS6000-590(66.6MHz) FORTRAN77 - 03 - qarch=pwr2 - qtune=pwr2
Eff
icie
ncy
(%
)
IBM RS6000 *
VPP500
(C2+C(I)*(C3+D(I)*
(C4+E(I)*(C5+F(I)*
Z(I)=C0+A(I)*(C1+B(I)*
(C8+K(I)*(C9+L(I))))))))))
(C6+G(I)*(C7+H(I)*
Courtesy from Fujitsu
![Page 31: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999](https://reader037.vdocument.in/reader037/viewer/2022110322/56649d595503460f94a3896f/html5/thumbnails/31.jpg)
Kyoto, May 28th. 1999 31
Vector ideas used in SS’s/VLIW processors
• Address prediction and Prefetching• Exploitation of data locality(the stride value is
used for locality detection and exploitation)• Predicate execution(VLIW)• Multiply and add, chaining• Multi-size operands• Data reuse and vectorization• Addressing modes (auto-increment)• Multithreading ( 2 scalar processors in Fujitsu
machines)• Dynamic load/store elimination
![Page 32: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999](https://reader037.vdocument.in/reader037/viewer/2022110322/56649d595503460f94a3896f/html5/thumbnails/32.jpg)
Kyoto, May 28th. 1999 32
Predictions for ALL instructions
0102030405060708090
100
Last valueStrideContext 1Context 3
Y.Sazeides and J.E. Smith ¨The predictability of data values¨MICRO-30.1997
![Page 33: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999](https://reader037.vdocument.in/reader037/viewer/2022110322/56649d595503460f94a3896f/html5/thumbnails/33.jpg)
Kyoto, May 28th. 1999 33
Characterization of Vector Programs
0102030405060708090
100
% vector access% vectorizationAvg. VL
R. Espasa “ Advanced Vector Architectures “. PhD Thesis, Feb.97
![Page 34: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999](https://reader037.vdocument.in/reader037/viewer/2022110322/56649d595503460f94a3896f/html5/thumbnails/34.jpg)
Kyoto, May 28th. 1999 34
SS’s ideas usable in vector processors
• Decoupled Vector Architectures
• Multithreaded Vector Architectures
• Out-of-order Vector Architectures
• Simultaneous Multithreaded Vector Architecture
• Victim Register File
R. Espasa, M. Valero and J.E. Smith HPCA96, HPCA97, MICRO97, ICS97...
![Page 35: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999](https://reader037.vdocument.in/reader037/viewer/2022110322/56649d595503460f94a3896f/html5/thumbnails/35.jpg)
Kyoto, May 28th. 1999 35
ILP+DLP: Out-of-order Vector
LD/STS registers A registers V registers
Reorder Buffer Memory
Decode & RenameFetch
R. Espasa, M. Valero, J.E. Smith “Out-of-order Vector Architecture” MICRO30, 1997.
![Page 36: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999](https://reader037.vdocument.in/reader037/viewer/2022110322/56649d595503460f94a3896f/html5/thumbnails/36.jpg)
Kyoto, May 28th. 1999 36
OOO Vector Performance
R. Espasa, M. Valero, J.E. Smith “Out-of-order Vector Architecture” MICRO30, 1997.
![Page 37: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999](https://reader037.vdocument.in/reader037/viewer/2022110322/56649d595503460f94a3896f/html5/thumbnails/37.jpg)
Kyoto, May 28th. 1999 37
Vector Processors : The Future
• Very high-performance architectures
• Vector Microprocessors• Numerical Accelerators• Multimedia Applications
![Page 38: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999](https://reader037.vdocument.in/reader037/viewer/2022110322/56649d595503460f94a3896f/html5/thumbnails/38.jpg)
Kyoto, May 28th. 1999 38
Architectures for a Billion Transistors
• Advanced/Superspeculative Architectures
• Trace Processors
• Simultaneous Multithreading
• Multiprocessor on a chip
• RAW processors
• IRAM
Billion -Transistor Architectures. IEEE Computer Sept. 1997
![Page 39: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999](https://reader037.vdocument.in/reader037/viewer/2022110322/56649d595503460f94a3896f/html5/thumbnails/39.jpg)
Kyoto, May 28th. 1999 39
SMV• Simultaneous Multithreaded Vector Arch.
• Mixes three paradigms– DLP: vector unit– ILP: O-o-O execution– TLP: multithreaded fetch unit
• Requires a memory system with– high performance at low cost– low pin-count
R. Espasa and M. Valero ¨Exploiting Instruction and Data-Level Parallelism¨IEEE MICRO Sep. 1997
![Page 40: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999](https://reader037.vdocument.in/reader037/viewer/2022110322/56649d595503460f94a3896f/html5/thumbnails/40.jpg)
Kyoto, May 28th. 1999 40
Billion Trans. Vector Architecture
R. Espasa and M. Valero ¨Exploiting Instruction and Data-Level Parallelism¨IEEE MICRO Sep. 1997
Memory
M
e
m
o
r
y
FPU 1
FPU 2
ALU 1
ALU 2
@ gen
@ gen
VFU 1
VFU 2
VFU 3
VFU 4
k
k
k
k
k
kk
k
K (data)
FPRF
128 reg
IRF
128 reg
Vector
Register
File
128 reg
2 data
1
1
Float point
queue (64)
Integer
queue (64)
Memory
queue (64)
Memory
queue (64)
Instruction Issue Execution Pipeline
I cache Decode
8 program
counters
(one/ thread)
8 rename
tables
(one/thread)
I F V
Inst fetch Inst decode
Thread ID
Reorder Buffer
Instruction Slots
PC
B
![Page 41: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999](https://reader037.vdocument.in/reader037/viewer/2022110322/56649d595503460f94a3896f/html5/thumbnails/41.jpg)
Kyoto, May 28th. 1999 41
SMV Performance
R. Espasa and M. Valero ¨Exploiting Instruction and Data-Level Parallelism¨IEEE MICRO Sep. 1997
![Page 42: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999](https://reader037.vdocument.in/reader037/viewer/2022110322/56649d595503460f94a3896f/html5/thumbnails/42.jpg)
Kyoto, May 28th. 1999 42
V-IRAM1
Memory Crossbar Switch
M
M
…
M
M
M
…
M
M
M
…
M
M
M
…
M
M
M
…
M
M
M
…
M
…
M
M
…
M
M
M
…
M
M
M
…
M
M
M
…
M
+
Vector Registers
x
÷
Load/Store
8K I cache 8K D cache
2-way Superscalar processor
Vector
4 x 64 4 x 64 4 x 64 4 x 64 4 x 64
4 x 64or
8 x 32or
16 x 16
4 x 644 x 64
QueueInstruction
I/OI/O
I/OI/O
SerialI/O
D.A. Patterson “ New directions in Computer Architecture” Berkeley, June 1998
0.18 µm, 200 MHz, 1.6GFLOPS(64b)/6.4GOPS(16b)/32MB
![Page 43: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999](https://reader037.vdocument.in/reader037/viewer/2022110322/56649d595503460f94a3896f/html5/thumbnails/43.jpg)
Kyoto, May 28th. 1999 43
Conflict-free access to vectors
Memory Modules
Inte
rcon
nect
ion
Net
wor
k
Inte
rcon
nect
ion
Net
wor
k
Sections
P1
P2
Pn
P3
P1
P2
P3
Pn
Idea: Out-of-order access
M. Valero et al. ISCA 92, ISCA 95, IEEE-TC 95, ICS 92, ICS 94,...
![Page 44: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999](https://reader037.vdocument.in/reader037/viewer/2022110322/56649d595503460f94a3896f/html5/thumbnails/44.jpg)
Kyoto, May 28th. 1999 44
Command Memory System
Inte
rcon
nect
ion
Net
wor
kP1
P2
Pn
P3
Memory Modules
Inte
rcon
nect
ion
Net
wor
k
P1
P2
P3
PnCommands Sections Controller
Command = <@,Length,Stride,size>Break commands into bursts at the section controller
J. Corbal, R. Espasa and M. Valero “ Command-Vector Memory System” PACT98
![Page 45: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999](https://reader037.vdocument.in/reader037/viewer/2022110322/56649d595503460f94a3896f/html5/thumbnails/45.jpg)
Kyoto, May 28th. 1999 45
System configuration in 2009
Memory(5TB)
X-Bar
Chip Chip
Memory(5TB)
X-bar
Chip Chip200GF 200GF 200GF 200GF
32Chips6.4TFLOPS
32Chips6.4TFLOPS
32 SMP(cc-NUMA) Nodes 200TFLOPS/160TB
100GB/Sec
800GB/SecX-Bar
Sustained Scalar 250GFLOPS? Vector 1TFLOPS?
T. Watanabe SC98, Orlando.
![Page 46: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999](https://reader037.vdocument.in/reader037/viewer/2022110322/56649d595503460f94a3896f/html5/thumbnails/46.jpg)
Kyoto, May 28th. 1999 46
Vector Microprocessors
• Ways of reducing the design impact• Short Vectors (64 x 16 words = 8 Kbytes)• Vector Functionall units shared with INT/FP units• Vector Register renaming to allow precise exceptions
• Cache hierarchy tuned to vector execution• Vector data locality allows large data transactions
• Very large bandwidth between cache and vector registers
• High performance for numerical and multimedia applications
![Page 47: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999](https://reader037.vdocument.in/reader037/viewer/2022110322/56649d595503460f94a3896f/html5/thumbnails/47.jpg)
Kyoto, May 28th. 1999 47
General Architecture
1024FP INT
8
I-CacheFetch
Decode
RambusController
RDRAM
RDRAM
RDRAM
RDRAM
Vector Cache
VRF
![Page 48: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999](https://reader037.vdocument.in/reader037/viewer/2022110322/56649d595503460f94a3896f/html5/thumbnails/48.jpg)
Kyoto, May 28th. 1999 48
Vector PC Vs SuperScalar
0
5
10
15
20
25
Hydro2D Dyfesm Swm256 Tomcatv
OoO-SS 1x2VEC 16 1x2VEC 16 16x32
![Page 49: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999](https://reader037.vdocument.in/reader037/viewer/2022110322/56649d595503460f94a3896f/html5/thumbnails/49.jpg)
Kyoto, May 28th. 1999 49
Cache Hierarchy
•Where should be allocated the Vector Cache?
DIRECT RAMBUS
L2
VC CPU
VC
L1 CPU
DIRECT RAMBUS
![Page 50: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999](https://reader037.vdocument.in/reader037/viewer/2022110322/56649d595503460f94a3896f/html5/thumbnails/50.jpg)
Kyoto, May 28th. 1999 50
Performance of the cache hierarchies
0
1
2
3
4
5
6
7
8
2 8 16 320
1
2
3
4
5
6
7
2 8 16 320
2
4
6
8
10
12
2 8 16 32
BDNA FLO52 HYDRO2D
EIP
C
FLOPS/CYCLE FLOPS/CYCLE FLOPS/CYCLE
VECTOR CACHE on L1
VECTOR CACHE on L2
PERFECT CACHE
![Page 51: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999](https://reader037.vdocument.in/reader037/viewer/2022110322/56649d595503460f94a3896f/html5/thumbnails/51.jpg)
Kyoto, May 28th. 1999 51
Importance of media Applications
“On the next five years, (1998-2002), we believe that media processing will become the dominant force in computer architecture” (K. Diefendorf and P. K. Dubey in IEEE Computer Journal, Sep.97, pp. 43-45)
“90% of Desktop Cycles will Be Spent on Media Applications by 2000” ( Scott Kirkpatrick of IBM )
![Page 52: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999](https://reader037.vdocument.in/reader037/viewer/2022110322/56649d595503460f94a3896f/html5/thumbnails/52.jpg)
Kyoto, May 28th. 1999 52
Characteristics of media Applications• Examples: Image/ speech processing,
communications, virtual reality, graphics…
• Data structures: matrices and vectors
• Data types: Integer(8 -32 bits), FP (32- 64)
• Demand for high memory bandwidth
• Low data locality and latency problem
• No critical data-dependences
• Real time necessity
• Fine/coarse grain parallelism
![Page 53: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999](https://reader037.vdocument.in/reader037/viewer/2022110322/56649d595503460f94a3896f/html5/thumbnails/53.jpg)
Kyoto, May 28th. 1999 53
Multimedia Applications and Architectures
• • • •
• • • •
• • • •
• • • •
Scientific Applications
Multimedia
Re-discover the parallelism at run-time using a lot of hardware Simple hardware, but
loss of parallelismAs many instructions as SS approach
Superscalar
+ MMXVLIW Vector Architectures
Natural way to express and execute DLP applications
![Page 54: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999](https://reader037.vdocument.in/reader037/viewer/2022110322/56649d595503460f94a3896f/html5/thumbnails/54.jpg)
Kyoto, May 28th. 1999 54
MMX-like processors
• Multimedia extensions are designed to exploit the parallelism inherent in multimedia aplications
• Targeted to leverage full compatibility with existing operating systems and applications, plus minimum chip area investment.
• The highlights of multimedia extensions are:
• Single Instruction, Multiple Data (SIMD) techniques
• New data types (Multimedia Vectors, 32/64 bits)
• Multimedia registers
• SIMD-like instructions, over small integer data types
![Page 55: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999](https://reader037.vdocument.in/reader037/viewer/2022110322/56649d595503460f94a3896f/html5/thumbnails/55.jpg)
Kyoto, May 28th. 1999 55
MMX instruction example• PADDW: Parallel ADD of 4x16-bit data type with Wrap
Around (No Saturation)
A1 A2 xFFFFA3
A1+B1 A2+B2 x0005A3+B3
B1 B2 x0006B3
+ + + +
0 15 31 47 63
![Page 56: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999](https://reader037.vdocument.in/reader037/viewer/2022110322/56649d595503460f94a3896f/html5/thumbnails/56.jpg)
Kyoto, May 28th. 1999 56
Superscalar Multimedia Processors
Register File 32*128 8*64 32*64 32*64 32*64 32*64Mapped Onto Separate FP FP FP IntegerIntegerInteger Support 8/16/32 8/16/32 8/16/32 8/16 bit 16/32 8 bitFP Support Yes MMX2 No MIPS V/ No NoUsual stuff+ Lots Lots Lots Lots Some NoneMultiply /MAC Lots Mult Mult Lots Some NoneMin/Max/Avg Yes No No Min/MaxAvg Min/MaxPack/Unpack Yes Yes Yes Yes Yes YesByte ReorderingAll Some Some Many All NoneUnaligned Data 3 Inst. No 2 Inst. Yes No NoAnnounced 2Q98 2Q96 4Q94 4Q96 4Q95 4Q96
HP MAX2
Alpha MVI
PowerPC Altivec
Intel MMX
Sun VIS
MIPS V /MDMX
Microprocessor Report Vol 12, N 6, May 11, 1998
![Page 57: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999](https://reader037.vdocument.in/reader037/viewer/2022110322/56649d595503460f94a3896f/html5/thumbnails/57.jpg)
Kyoto, May 28th. 1999 57
Multimedia Applications and Architectures
• • • •
• • • •
• • • •
• • • •
Scientific Applications
Multimedia
Re-discover the parallelism at run-time using a lot of hardware Simple hardware, but
loss of parallelismAs many instructions as SS approach
Superscalar
+ MMXVLIW Vector Architectures
Natural way to express and execute DLP applications
![Page 58: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999](https://reader037.vdocument.in/reader037/viewer/2022110322/56649d595503460f94a3896f/html5/thumbnails/58.jpg)
Kyoto, May 28th. 1999 58
Multimedia Embedded Systems
• NEC V830R/AV includes MIX2, a multimedia
instruction extension (SIMD, MMX-like approach)
• Hitachi SH4 includes FP 4-length vector
instructions, targeted at geometry transformation in
3D rendering applications
• ARM10 Thumb Family processors will include a
Vector FP unit capable of delivering 600 MFLOPS
![Page 59: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999](https://reader037.vdocument.in/reader037/viewer/2022110322/56649d595503460f94a3896f/html5/thumbnails/59.jpg)
Kyoto, May 28th. 1999 59
Widen is better…(?)
• Most multimedia algorithms exhibit vectors no longer than 8/16 elements => widening the multimedia registers could provide diminishing returns.
C1
B1
+
0 15
A1 A2 A4A3
C1 C2 C4C3
B1 B2 B4B3
+ + + +
0 15 31 47 63
A1 A1 A2 A4A3
C1 C2 C4C3
B1 B2 B4B3
+ + + +
0 15 31 47
A5 A6 A8A7
C5 C6 C8C7
B5 B6 B8B7
+ + + +
63 79 95 111 127
![Page 60: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999](https://reader037.vdocument.in/reader037/viewer/2022110322/56649d595503460f94a3896f/html5/thumbnails/60.jpg)
Kyoto, May 28th. 1999 60
VLIW : Widening vs Replication
Memory
Register File
1 word
Memory
Register File
1 word1 word
Memory
Register File
2 words
Memory
Register File
2 words2 words
Bus configurations:
D. López et al. ¨Increasing Memory Bandwidth with Wide Busses¨ICS-97
![Page 61: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999](https://reader037.vdocument.in/reader037/viewer/2022110322/56649d595503460f94a3896f/html5/thumbnails/61.jpg)
Kyoto, May 28th. 1999 61
Widening and Replication Performance
1
2
3
4
5
6
7
8
2 4 8 16
Wide 1wide 2Wide 4
D. López et al. ¨ Widening versus replicating...¨ ICS98, MICRO98
![Page 62: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999](https://reader037.vdocument.in/reader037/viewer/2022110322/56649d595503460f94a3896f/html5/thumbnails/62.jpg)
Kyoto, May 28th. 1999 62
Multimedia Applications and Architectures
• • • •
• • • •
• • • •
• • • •
Scientific Applications
Multimedia
Re-discover the parallelism at run-time using a lot of hardware Simple hardware, but
loss of parallelismAs many instructions as SS approach
Superscalar
+ MMXVLIW Vector Architectures
Natural way to express and execute DLP applications
![Page 63: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999](https://reader037.vdocument.in/reader037/viewer/2022110322/56649d595503460f94a3896f/html5/thumbnails/63.jpg)
Kyoto, May 28th. 1999 63
Torrent T0 Microprocessor• The first single-chip vector microprocessor.
• Can sustain over 24 operations per cycle while having a issue rate of only one 32-bit instruction per cycle
• Features:• 16 vector registers (32 32-bit elements each)• 2 Vector arithmetic units (8 pipes each)• Reconfigurable composite operation pipelines • 128-bit wide, external memory interface• MIPS-II, 32-bit instruction set, scalar unit.
K. Asanovic et al. “ The T0 vector microprocessor “. Hot Chips VII, 1995
![Page 64: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999](https://reader037.vdocument.in/reader037/viewer/2022110322/56649d595503460f94a3896f/html5/thumbnails/64.jpg)
Kyoto, May 28th. 1999 64
Torrent T0 Microprocessor
K. Asanovic et al. “ The T0 vector microprocessor “. Hot Chips VII, 1995
![Page 65: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999](https://reader037.vdocument.in/reader037/viewer/2022110322/56649d595503460f94a3896f/html5/thumbnails/65.jpg)
Kyoto, May 28th. 1999 65
Vector versus Superscalar Processors• Comparison of Die Area
– Processor Die Area (in mm2 scaled to 0.25
0
50
100
150
200
250
Torrent-0 Alpha 21164 UltraSPARCII
MIPSR10000
HP PA-8000 Alpha 21264 6-way OoO,Rob128
ControlRegistersDatapath
14.73 21.8637.77
66.92 67.77 69.81
250.0
C. G. Lee and D. J. DeVries “ Initial Results on … “. MICRO-30, 1997.
![Page 66: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999](https://reader037.vdocument.in/reader037/viewer/2022110322/56649d595503460f94a3896f/html5/thumbnails/66.jpg)
Kyoto, May 28th. 1999 66
• Component Percentages
0
10
20
30
40
50
60
70
80
90
100
Torrent-0 Alpha 21164 UltraSPARCII
MIPSR10000
HP PA-8000 Alpha 21264 6-way OoO,Rob128
Datapath Registers Control
C. G. Lee and D. J. DeVries “ Initial Results on … “. MICRO-30, 1997.
Vector versus Superscalar Processors
![Page 67: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999](https://reader037.vdocument.in/reader037/viewer/2022110322/56649d595503460f94a3896f/html5/thumbnails/67.jpg)
Kyoto, May 28th. 1999 67
Imagine project
• Focused on developing a programmable architecture that achieves performance similar to special purpose hardware on graphics and image processing.
• Matches media applications demands to the current VLSI capabilities by using a stream-based programming model.
• Most multimedia kernels exhibit a streaming nature.
• Individual stream elements can be operated on in parallel, thus exploiting data parallelism.
Bill Dally “ Tomorrow Computing Engines”Keynote HPCA98
![Page 68: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999](https://reader037.vdocument.in/reader037/viewer/2022110322/56649d595503460f94a3896f/html5/thumbnails/68.jpg)
Kyoto, May 28th. 1999 68
Imagine architecture• Organized around a large stream register file (64Kb)• Memory operations move entire streams of data• Data streams pass through a set of arithmetic clusters (8)• Each cluster unit operates a single element under VLIW control
SDRAM
SDRAM
SDRAM
SDRAM
...
Str
eam
ing
Mem
ory
Sys
temC
C
C
C
Stream Register File
CLUSTER 7
CLUSTER 0
CLUSTER 1
...
Controller
Bill Dally “ Tomorrow Computing Engines”Keynote HPCA98
![Page 69: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999](https://reader037.vdocument.in/reader037/viewer/2022110322/56649d595503460f94a3896f/html5/thumbnails/69.jpg)
Kyoto, May 28th. 1999 69
Matrix extensions for Multimedia• By combining conventional vector approach together with SIMD MMX-like instructions, we can exploit additional levels of DLP with matrix oriented multimedia
extensions.
C1
B1
+
A1 A1 A2 A4A3
C1 C2 C4C3
B1 B2 B4B3
+ + + +
0 15 31 47 63
A1 A2 A4A3
0 15 31 47 63
A5 A6 A8A7
A9 A10 A12A11
A13 A14 A16A15
+
B1 B2 B4B3
15 31 47 63
B5 B6 B8B7
B9 B10 B12B11
B13 B14 B16B15
C1 C2 C4C3
C5 C6 C8C7
C9 C10 C12C11
C13 C14 C16C15
![Page 70: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999](https://reader037.vdocument.in/reader037/viewer/2022110322/56649d595503460f94a3896f/html5/thumbnails/70.jpg)
Kyoto, May 28th. 1999 70
Relative Performance
0
1
2
3
4
5
6
7
way 1 way 2 way 4 way 80
5
10
15
20
25
way 1 way 2 way 4 way 8
MMX MDMX MOM
0
1
2
3
4
5
6
7
8
9
way 1 way 2 way 4 way 8
INVERSE DCT TRANSFORM
MPEG-2 MOTION ESTIMATION
RGB-YCC Color CONVERSION
![Page 71: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999](https://reader037.vdocument.in/reader037/viewer/2022110322/56649d595503460f94a3896f/html5/thumbnails/71.jpg)
Kyoto, May 28th. 1999 71
Applications and Architectures
+ FPU
+ FPU VFPU+
Integer
Integer
Integer
Numerical Applications
Very Slow+ Subroutines
Very Big Improvement !!!
Additional Speed
![Page 72: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999](https://reader037.vdocument.in/reader037/viewer/2022110322/56649d595503460f94a3896f/html5/thumbnails/72.jpg)
Kyoto, May 28th. 1999 72
Future Applications
• Integer SPEC-like• Commercial
(OLTP,DSS)
• Numerical• Multimedia
IntegerInteger Commercial Numerical Multimedia
![Page 73: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999](https://reader037.vdocument.in/reader037/viewer/2022110322/56649d595503460f94a3896f/html5/thumbnails/73.jpg)
Kyoto, May 28th. 1999 73
Acknowledgments
• Roger Espasa• James E. Smith• Luis A. Villa• Francisca Quintana• Jesús Corbal• David López• Josep Llosa• Eduard Ayguade
• Krste Asanovic• William Dally• Christoforos E. Kozyrakis• Corinna G. Lee• David A. Patterson• Steve Wallace
![Page 74: The Future of Vector Processors M. Valero, R. Espasa and J. Corbal UPC, Barcelona Kyoto, May 28th, 1999](https://reader037.vdocument.in/reader037/viewer/2022110322/56649d595503460f94a3896f/html5/thumbnails/74.jpg)
Kyoto, May 28th. 1999 74
The End