a comparison of the viram-1 and embedded vliw architectures for use on svd
Post on 07-Jan-2016
42 Views
Preview:
DESCRIPTION
TRANSCRIPT
-
A Comparison of the VIRAM-1 and Embedded VLIW architectures for use on SVD
CS 252Spring 2000Jeff HermanJohn LooXiaoyi Tang
-
MotivationSVD ApplicationsSmart antennasImage processingMedical imagingVLIW Trend in high performance embedded computingVectorOut of favorFlynn bottleneck is a limiting factor in parallelismKnown for linear algebra performance
-
C67 Architecture (mapped)Instruction Ram (cache optional)Data Ram (>4 banks)Decode Logic (8-way)A Register FileB Register FileL1S1M1D1D2M2S2L2
-
C67 ArchitectureSplit Register Files16 registers per register fileOne cross path per register fileInstruction LatenciesBranches - 6 cyclesLoad - 5 cyclesFP add/multiply - 4 cycles
-
TM 1100 VLIW Processor Core Architecture5-issue VLIW2 FP adders/multipliers2 Load/Store Units128 general purpose 32 bit registers16KB data cache, 32KB instruction cacheInstruction Latencies3 cycles for Branches, Load, FP add/multiply
-
VIRAM-1 Microarchitecture2-way-issue superscalar MIPS IV coreAsynchronous vector unitCommunication to scalar core through queue32 general purpose vector and flag registers32 scalar and control register2 VAFU, 2 FFU, 1 VMFU4-lane standard configuration
-
VIRAM-1 Microarchitecture
-
Testing ConditionsSVD routine from CLAPACKRandom test matrices with a rank of 10Matrix dimension ratio of 10Sizes range from 100x10 to 300x30Suboptimal parameters usedTrends should still holdAssumed 200 Mhz clock rate
-
Chart6
56039490883714282201118000709470693096
142316344031803606983281000015516881492161
296640097061728069312570500027881312650600
54168431826278313689721999000045091614235554
889863532285485236596751662800071078186614237
TI 'C67 Ideal
TI 'C6711 Cache
TM1100 Cache
TM1100 Ideal
IRAM (4-lane)
IRAM (16-lane)
Columns
Cycles in millions
Columns vs. Cycles
Sheet1
TI 'C67 IdealTI 'C6711 CacheTM1100 CacheTM1100 IdealIRAM (4-lane)IRAM (16-lane)
1056039490883714282201118000709470693096
15142316344031803606983281000015516881492161
20296640097061728069312570500027881312650600
2554168431826278313689721999000045091614235554
30889863532285485236596751662800071078186614237
iram, 250x20 was the 4 lane version
Sheet1
000000
000000
000000
000000
000000
TI 'C67 Ideal
TI 'C6711 Cache
TM1100 Cache
TM1100 Ideal
IRAM (4-lane)
IRAM (16-lane)
Columns
Cycles in millions
Columns vs. Cycles
Sheet2
Sheet3
-
Ideal C67 and TM 1100 Performance Gap Same memory bottlenecks in both processorsProgramming modelC67Assembly coded kernels1700 linesTM 1100Only C level optimizations
-
Chart3
2.65457388773.31010464673.78464483354.1898073464.3996927648
3.26240265464.34404161925.30265598346.09763660386.5222303862
3.44775818574.82432550876.15246557647.35503012648.0564702135
3.50742902214.96163240096.3807592397.69248602298.4709696125
3.52920951795.01678304156.47169697437.83014807518.6576764637
100X10
150X15
200X20
250X25
300X30
Lane Count
Gain vs. standard MIPS Core
VIRAM-1 Vector Core Scalability
Sheet1
baseline24460817485848171538803316501557263924
baseline24460817485848171538803316501557263924
baseline24460817485848171538803316501557263924
baseline24460817485848171538803316501557263924
baseline24460817485848171538803316501557263924gain
100X10150X15200X20250X25300X30100X10150X15200X20250X25300X30
19214592261514453249479156421301543712.65457388773.31010464673.78464483354.1898073464.3996927648
2749779172324532349605438995877980723.26240265464.34404161925.30265598346.09763660386.5222303862
4709470155168827881314509161710781843.44775818574.82432550876.15246557647.35503012648.0564702135
8697400150874726883764311352676002083.50742902214.96163240096.3807592397.69248602298.4709696125
166930961492161265060042355546614237163.52920951795.01678304156.47169697437.83014807518.6576764637
scalar5467011091058186702128907224342629
vector18821725775978129430752444869941879336
efficiency
100X10150X15200X20250X25300X30
10.51064995840.63850787570.71390469570.77216412140.8044166323
20.31378779610.41897539240.50012500150.56188457150.5962451111
40.1658079270.23264897650.29013779750.33887538890.3682506361
80.08433879410.1196352420.15045183180.17721165980.1935984287
160.04243126130.06048251910.07629802570.0901914890.0989327454
Sheet1
00000
00000
00000
00000
00000
100X10
150X15
200X20
250X25
300X30
Lane Count
sustained/peak bandwidth
Utilization of vector core
Sheet2
00000
00000
00000
00000
00000
100X10
150X15
200X20
250X25
300X30
Lane Count
Gain vs. standard MIPS Core
VIRAM-1 Vector Core Scalability
Sheet3
-
Chart5
0.51064995840.63850787570.71390469570.77216412140.8044166323
0.31378779610.41897539240.50012500150.56188457150.5962451111
0.1658079270.23264897650.29013779750.33887538890.3682506361
0.08433879410.1196352420.15045183180.17721165980.1935984287
0.04243126130.06048251910.07629802570.0901914890.0989327454
100X10
150X15
200X20
250X25
300X30
Lane Count
sustained/peak bandwidth
Utilization of Vector Core
Sheet1
baseline24460817485848171538803316501557263924
baseline24460817485848171538803316501557263924
baseline24460817485848171538803316501557263924
baseline24460817485848171538803316501557263924
baseline24460817485848171538803316501557263924gain
100X10150X15200X20250X25300X30100X10150X15200X20250X25300X30
19214592261514453249479156421301543712.65457388773.31010464673.78464483354.1898073464.3996927648
2749779172324532349605438995877980723.26240265464.34404161925.30265598346.09763660386.5222303862
4709470155168827881314509161710781843.44775818574.82432550876.15246557647.35503012648.0564702135
8697400150874726883764311352676002083.50742902214.96163240096.3807592397.69248602298.4709696125
166930961492161265060042355546614237163.52920951795.01678304156.47169697437.83014807518.6576764637
scalar5467011091058186702128907224342629
vector18821725775978129430752444869941879336
efficiency
100X10150X15200X20250X25300X30
10.51064995840.63850787570.71390469570.77216412140.8044166323
20.31378779610.41897539240.50012500150.56188457150.5962451111
40.1658079270.23264897650.29013779750.33887538890.3682506361
80.08433879410.1196352420.15045183180.17721165980.1935984287
160.04243126130.06048251910.07629802570.0901914890.0989327454
Sheet1
00000
00000
00000
00000
00000
100X10
150X15
200X20
250X25
300X30
Lane Count
Gain vs. standard MIPS Core
VIRAM-1 Vector Core Scalability
Sheet2
00000
00000
00000
00000
00000
100X10
150X15
200X20
250X25
300X30
Lane Count
sustained/peak bandwidth
Utilization of Vector Core
Sheet3
-
VIRAM Performance SummaryGains from vector unit limited by Amdahls law.Vector instructions comprise only ~15% of total code.Not much else of SVD can be vectorized.Gains limited by what cannot be vectorized.Perhaps streamline LAPACK or handcode assembly?Sub-linear scalability.Scaling IRAM is cheap but gains diminish.Efficiency and scalability increase with size of data set.
-
Concluding RemarksLimitations of both architecture are differentVIRAM: Scalar coreVLIW: Memory bandwidthVLIW cannot match performance of VIRAM when computing SVD.VLIW with vector coprocessor?
top related