a comparison of the viram-1 and embedded vliw architectures for use on svd

Download A Comparison of the VIRAM-1 and Embedded VLIW architectures for use on SVD

If you can't read please download the document

Upload: gerd

Post on 07-Jan-2016

42 views

Category:

Documents


0 download

DESCRIPTION

A Comparison of the VIRAM-1 and Embedded VLIW architectures for use on SVD. CS 252 Spring 2000 Jeff Herman John Loo Xiaoyi Tang. Motivation. SVD Applications Smart antennas Image processing Medical imaging VLIW Trend in high performance embedded computing Vector Out of favor - PowerPoint PPT Presentation

TRANSCRIPT

  • A Comparison of the VIRAM-1 and Embedded VLIW architectures for use on SVD

    CS 252Spring 2000Jeff HermanJohn LooXiaoyi Tang

  • MotivationSVD ApplicationsSmart antennasImage processingMedical imagingVLIW Trend in high performance embedded computingVectorOut of favorFlynn bottleneck is a limiting factor in parallelismKnown for linear algebra performance

  • C67 Architecture (mapped)Instruction Ram (cache optional)Data Ram (>4 banks)Decode Logic (8-way)A Register FileB Register FileL1S1M1D1D2M2S2L2

  • C67 ArchitectureSplit Register Files16 registers per register fileOne cross path per register fileInstruction LatenciesBranches - 6 cyclesLoad - 5 cyclesFP add/multiply - 4 cycles

  • TM 1100 VLIW Processor Core Architecture5-issue VLIW2 FP adders/multipliers2 Load/Store Units128 general purpose 32 bit registers16KB data cache, 32KB instruction cacheInstruction Latencies3 cycles for Branches, Load, FP add/multiply

  • VIRAM-1 Microarchitecture2-way-issue superscalar MIPS IV coreAsynchronous vector unitCommunication to scalar core through queue32 general purpose vector and flag registers32 scalar and control register2 VAFU, 2 FFU, 1 VMFU4-lane standard configuration

  • VIRAM-1 Microarchitecture

  • Testing ConditionsSVD routine from CLAPACKRandom test matrices with a rank of 10Matrix dimension ratio of 10Sizes range from 100x10 to 300x30Suboptimal parameters usedTrends should still holdAssumed 200 Mhz clock rate

  • Chart6

    56039490883714282201118000709470693096

    142316344031803606983281000015516881492161

    296640097061728069312570500027881312650600

    54168431826278313689721999000045091614235554

    889863532285485236596751662800071078186614237

    TI 'C67 Ideal

    TI 'C6711 Cache

    TM1100 Cache

    TM1100 Ideal

    IRAM (4-lane)

    IRAM (16-lane)

    Columns

    Cycles in millions

    Columns vs. Cycles

    Sheet1

    TI 'C67 IdealTI 'C6711 CacheTM1100 CacheTM1100 IdealIRAM (4-lane)IRAM (16-lane)

    1056039490883714282201118000709470693096

    15142316344031803606983281000015516881492161

    20296640097061728069312570500027881312650600

    2554168431826278313689721999000045091614235554

    30889863532285485236596751662800071078186614237

    iram, 250x20 was the 4 lane version

    Sheet1

    000000

    000000

    000000

    000000

    000000

    TI 'C67 Ideal

    TI 'C6711 Cache

    TM1100 Cache

    TM1100 Ideal

    IRAM (4-lane)

    IRAM (16-lane)

    Columns

    Cycles in millions

    Columns vs. Cycles

    Sheet2

    Sheet3

  • Ideal C67 and TM 1100 Performance Gap Same memory bottlenecks in both processorsProgramming modelC67Assembly coded kernels1700 linesTM 1100Only C level optimizations

  • Chart3

    2.65457388773.31010464673.78464483354.1898073464.3996927648

    3.26240265464.34404161925.30265598346.09763660386.5222303862

    3.44775818574.82432550876.15246557647.35503012648.0564702135

    3.50742902214.96163240096.3807592397.69248602298.4709696125

    3.52920951795.01678304156.47169697437.83014807518.6576764637

    100X10

    150X15

    200X20

    250X25

    300X30

    Lane Count

    Gain vs. standard MIPS Core

    VIRAM-1 Vector Core Scalability

    Sheet1

    baseline24460817485848171538803316501557263924

    baseline24460817485848171538803316501557263924

    baseline24460817485848171538803316501557263924

    baseline24460817485848171538803316501557263924

    baseline24460817485848171538803316501557263924gain

    100X10150X15200X20250X25300X30100X10150X15200X20250X25300X30

    19214592261514453249479156421301543712.65457388773.31010464673.78464483354.1898073464.3996927648

    2749779172324532349605438995877980723.26240265464.34404161925.30265598346.09763660386.5222303862

    4709470155168827881314509161710781843.44775818574.82432550876.15246557647.35503012648.0564702135

    8697400150874726883764311352676002083.50742902214.96163240096.3807592397.69248602298.4709696125

    166930961492161265060042355546614237163.52920951795.01678304156.47169697437.83014807518.6576764637

    scalar5467011091058186702128907224342629

    vector18821725775978129430752444869941879336

    efficiency

    100X10150X15200X20250X25300X30

    10.51064995840.63850787570.71390469570.77216412140.8044166323

    20.31378779610.41897539240.50012500150.56188457150.5962451111

    40.1658079270.23264897650.29013779750.33887538890.3682506361

    80.08433879410.1196352420.15045183180.17721165980.1935984287

    160.04243126130.06048251910.07629802570.0901914890.0989327454

    Sheet1

    00000

    00000

    00000

    00000

    00000

    100X10

    150X15

    200X20

    250X25

    300X30

    Lane Count

    sustained/peak bandwidth

    Utilization of vector core

    Sheet2

    00000

    00000

    00000

    00000

    00000

    100X10

    150X15

    200X20

    250X25

    300X30

    Lane Count

    Gain vs. standard MIPS Core

    VIRAM-1 Vector Core Scalability

    Sheet3

  • Chart5

    0.51064995840.63850787570.71390469570.77216412140.8044166323

    0.31378779610.41897539240.50012500150.56188457150.5962451111

    0.1658079270.23264897650.29013779750.33887538890.3682506361

    0.08433879410.1196352420.15045183180.17721165980.1935984287

    0.04243126130.06048251910.07629802570.0901914890.0989327454

    100X10

    150X15

    200X20

    250X25

    300X30

    Lane Count

    sustained/peak bandwidth

    Utilization of Vector Core

    Sheet1

    baseline24460817485848171538803316501557263924

    baseline24460817485848171538803316501557263924

    baseline24460817485848171538803316501557263924

    baseline24460817485848171538803316501557263924

    baseline24460817485848171538803316501557263924gain

    100X10150X15200X20250X25300X30100X10150X15200X20250X25300X30

    19214592261514453249479156421301543712.65457388773.31010464673.78464483354.1898073464.3996927648

    2749779172324532349605438995877980723.26240265464.34404161925.30265598346.09763660386.5222303862

    4709470155168827881314509161710781843.44775818574.82432550876.15246557647.35503012648.0564702135

    8697400150874726883764311352676002083.50742902214.96163240096.3807592397.69248602298.4709696125

    166930961492161265060042355546614237163.52920951795.01678304156.47169697437.83014807518.6576764637

    scalar5467011091058186702128907224342629

    vector18821725775978129430752444869941879336

    efficiency

    100X10150X15200X20250X25300X30

    10.51064995840.63850787570.71390469570.77216412140.8044166323

    20.31378779610.41897539240.50012500150.56188457150.5962451111

    40.1658079270.23264897650.29013779750.33887538890.3682506361

    80.08433879410.1196352420.15045183180.17721165980.1935984287

    160.04243126130.06048251910.07629802570.0901914890.0989327454

    Sheet1

    00000

    00000

    00000

    00000

    00000

    100X10

    150X15

    200X20

    250X25

    300X30

    Lane Count

    Gain vs. standard MIPS Core

    VIRAM-1 Vector Core Scalability

    Sheet2

    00000

    00000

    00000

    00000

    00000

    100X10

    150X15

    200X20

    250X25

    300X30

    Lane Count

    sustained/peak bandwidth

    Utilization of Vector Core

    Sheet3

  • VIRAM Performance SummaryGains from vector unit limited by Amdahls law.Vector instructions comprise only ~15% of total code.Not much else of SVD can be vectorized.Gains limited by what cannot be vectorized.Perhaps streamline LAPACK or handcode assembly?Sub-linear scalability.Scaling IRAM is cheap but gains diminish.Efficiency and scalability increase with size of data set.

  • Concluding RemarksLimitations of both architecture are differentVIRAM: Scalar coreVLIW: Memory bandwidthVLIW cannot match performance of VIRAM when computing SVD.VLIW with vector coprocessor?