hw5 vector solutions

8/12/2019 Hw5 Vector Solutions

1/3

CSC506 Vector Processor Homework due Friday, June 11, 1999

Q1. A low-order interleaved memory has 32 memory modules that are addressed byword and numbered 0, 1, , 31. If the processor generates a word address in the

Memory Address Register (MAR) of A45C23B, which memory module will beaccessed?

With 32 modules, we have the five low-order bits to select the correct module.

The two low-order hex digits are 3B (0011 1011 binary). The five low-order bitsare 11011, so we select module 0x1B or decimal 27.

Q2. A company builds a vector computer with eight parallel processors at a clock rate of400 MHz. If one of the vector instructions generates a scalar by multiplying two vectors

and summing the products, what FLOPS rate can the vendor claim?

8 processors x 400 MHz x 2 FLOPS/Hz = 6.4 GFLOPS.

Q3. What is the memory bandwidth (in words/sec) required to support each processo rin question 2? (Note that the maximum will occur on VV x V type vectorinstructions.)

400 MHz x 1 FLOPS/Hz x (2 Fetch + 1 Store)/FLOPS = 1.2 Gwords/sec.

Note:the question explicitly asks for the bandwidth required for each processo r, notthe four parallel processors together. You also do not get 2 FLOPS per cycle onVV x V operations. 2 FLOPS per cycle is the special case of multiply/add that yieldsa scalar.


2/3

Q4. Use the vector computer and vector instruction described in question 2. Thefloating point ad dpipeline is 4 stages and the floating point mult iplypipeline is 10stages. What is the effective speedup for vectors of length 8? What is the effectivespeedup for vectors of length 1000?

The total length of each pipeline is 14 stages, because we are feeding the output of themultiplypipe directly into the addpipe, and we have 8 pipelines operating in parallel (theeight processors).

The best serial time is nk, the number of vector elements times the 14 stages. Theparallel time is the time it takes to get all of the vector elements through the eightparallel pipelines. The speedup using a sing le pipeline, considering the length of thevector is:

If we were using only a single pipeline to perform the calculations, the speedupwith a vector length of 8 and 14 pipeline stages would be:

If we were using a single pipeline, the speedup with a vector length of 1000 and

14 stages would be:

Butwe are splitting the input vectors into eight parallel (sub)vectors and processingthem in parallel, so the number of tasks for each pipe is 1/8 the total vector length. Ouractual speedup is then:

Using eight parallel pipelines to perform the calculations, the speedup with avector length of 8 with 14 pipeline stages is:

Using eight parallel pipelines, the speedup with a vector length of 1000 and 14stages would be:

)1()(

+=

nknkkS

3.5)18(14

)14(8=

+=S

8.13)11000(14

)14(1000=

+=S

TimeExecutionParallelTimeSerialBest

Speedup =

8)11(14

)14(8 =+

=S

101)1125(14

)14(1000=

+=S


3/3

Q5. A vector load instruction is used to load a vector of length 100 from 8 interleavedmemory modules into a vector register of a processor. The memory is organized likethe diagram on page 14 of the class notes and each memory unit is built up of SDRAMclocked at 100 MHz needing 6 clocks for RAS access time and 8 clocks cycle time. Ifthe data bus runs at 800 MHz and transfers one word at a time from the memory

system to vector register:

a. how long does it take for a vector with unit stride to load?b. how long does it take for a vector with a stride of 2 to load?c. how long does it take for a vector with a stride of 8 to load?

a. With a unit stride, we are reading sequential words out of all 8 SDRAMs inparallel, and transferring them a word at a time over the data bus. The SDRAMsneed 6 clocks at 10 ns (RAS access time) to get the first 8 words to the respectiveMDRs. We can then get 8 words per clock for the remaining 92 words. 92/8 =11.5 clocks, but we cant get 4 in a clock. We need to take the full extra clock

cycle but get only four words from the last access. So, it will take 6 + 12 = 18clocks to get the entire 100 words out of memory. At 10 ns per clock, the actualmemory read time will be 180 ns. Transfer across the data bus is 800 MHz, or1.25 ns per word. We need to allow for the last four words to get to the processorto complete the load, so the total time would be 180 + 4(1.25) = 185 ns.

b. With a stride of 2, we can read out of only every other module, so we get only 4words per SDRAM clock of 10 ns. We get 4 words from the first access, andneed 96/4 = 24 additional clocks. So, it will take 6 + 24 = 30 clocks. Memoryread time is 30 x 10 ns/clock = 300 ns. At 1.25 ns per word to transfer over thedata bus, the total load time would be 300 + 4(1.25) = 305 ns.

c. With a stride of 8, we are hitting the same SDRAM module for every word. Wecan get only 1 word per clock of 10 ns. So, it will take 6 + 99 = 105 clocks.Memory read time is 105 x 10 ns/clock = 1050 ns. At 1.25 ns per word to transferover the data bus, the total load time would be 1050 + 1(1.25) = 1051.25 ns.

hw5 vector solutions

Documents