allen michalski cse department – reconfigurable computing lab university of south carolina...

Allen MichalskiCSE Department – Reconfigurable Computing Lab

University of South Carolina

Microprocessors with FPGAs: Implementation and Workload Partitioning of the DARPA HPCS

Integer Sort Benchmark within the SRC-6e Reconfigurable Computer

MAPLD 2005/253Michalski

OutlineOutline

Reconfigurable Computing – Introduction SRC-6e architecture, programming model

Sorting Algorithms Design guidelines

Testing Procedures, ResultsConclusions, Future Work

Lessons learned


What is a Reconfigurable Computer?What is a Reconfigurable Computer?

Combination of: Microprocessor workstation for frontend processing FPGA backend for specialized coprocessing Typical PC bus for communications


What is a Reconfigurable Computer?What is a Reconfigurable Computer?

PC Characteristics High clock speed Superscalar, pipelined Out of order issue Speculative execution High-Level Language

FPGA Characteristics Low clock speed Large number of configurable

elements• LUTs, Block RAMs, CPAs• Multipliers

HDL Language


What is the SRC-6e?What is the SRC-6e?

SRC = Seymour R. Cray RC with high-throughput memory interface

1,415 MB/s for SNAP writes, 1,280 MB/s for SNAP reads PCI-X (1.0) = 1.064 GB/s


SRC-6e DevelopmentSRC-6e Development

Programming does not require knowledge of HW design C code can compile to hardware


FPGA Considerations Superscalar design

• Parallel, pipelined executionSRC Considerations

High overall data throughput• Streaming versus non-streaming data transfer?

Reduction of FPGA data processing stalls due to data dependencies, data read/write delays

• FPGA Block RAM versus SRC OnBoard Memory?Evaluate software/hardware partitioning

Algorithm partitioning Data size partitioning

SRC Design ObjectivesSRC Design Objectives


Sorting AlgorithmsSorting Algorithms

Traditional Algorithms Comparison Sorts: Θ(n lg n) best case

• Insertion sort• Merge sort• Heapsort• Quicksort

Counting Sorts• Radix sort: Θ(d(n+k))

HPCS FORTRAN code baseline Radix sort in combination with heapsort This research focuses on 128-bit operands

• SRC simplified data transfer, management


Memory Constraints SRC onboard memory

• 6 banks x 4 MB• Pipelined read or write access• 5 clock latency

FPGA BRAM memory• 144 blocks, 18 Kbit each• 1 clock read and write latency

Initial Choices Parallel Insertion Sort (BubbleSort)

• Produces sorted blocks• Use of onboard memory pipelined processing

– Minimize data access stalls Parallel Heapsort

• Random access merge of sorted lists• Use of BRAM for low latency access

– Good for random data access

Sorting – SRC FPGA ImplementationSorting – SRC FPGA Implementation


Parallel Insertion Sort (BubbleSort)Parallel Insertion Sort (BubbleSort)

Systolic array of cells Pipelined SRC processing from OnBoard Memory Keeps highest value, passes other values Latency 2x number of cells


Parallel Insertion Sort (BubbleSort)Parallel Insertion Sort (BubbleSort)

Systolic array of cells Results passed out in

reverse order of comparison

• N = # comparator cells Sorts a list completely in

Θ(L2) Limit sort size to some

number a < L (list size)• Create multiple sorted lists• Each list sorted in Θ(a)


Parallel Insertion Sort (BubbleSort)Parallel Insertion Sort (BubbleSort)#include <libmap.h>void parsort_test(int arraysize, int sortsize, int transfer, uint64_t datahigh_in[], uint64_t datalow_in[], uint64_t datahigh_out[], uint64_t datalow_out[], int64_t *start_transferin, int64_t *start_loop, int64_t *start_transferout, int64_t *end_transfer, int mapno) {

OBM_BANK_A (a, uint64_t, MAX_OBM_SIZE) OBM_BANK_B (b, uint64_t, MAX_OBM_SIZE) OBM_BANK_C (c, uint64_t, MAX_OBM_SIZE) OBM_BANK_D (d, uint64_t, MAX_OBM_SIZE)

DMA_CPU(CM2OBM, a, MAP_OBM_stripe(1, "A"), datahigh_in, 1, arraysize*8, 0); wait_DMA(0); …. while (arrayindex < arraysize) { endarrayindex = arrayindex + sortsize - 1; if (endarrayindex > arraysize - 1) endarrayindex = arraysize - 1;

while (arrayindex < endarrayindex) { for (i=arrayindex; i<=endarrayindex; i++) { data_high_in = a[i];data_low_in = b[i];

parsort(i==endarrayindex, data_high_in, data_low_in, &data_high_out, &data_low_out);

c[i] = data_high_out; d[i] = data_low_out;


Parallel HeapsortParallel Heapsort

Tree structure of cells Asynchronous operation

• Acknowledged data transfer Merges sorted lists in Θ(n lg n) Designed for Independent BRAM block accesses


Parallel HeapsortParallel Heapsort

BRAM Limitations 144 Block RAMs @ 512 32 bit values = not a whole

lot of 128-bit valuesOnBoard Memory

SRC constraint – Up to 64 reads and 8 writes in one MAP C file

Cascading clock delays as number of reads increase Explore the use of MUXd access: search and update

only 6 of 48 leaf nodes at a time in round-robin fashion


FPGA Initial ResultsFPGA Initial Results

Baseline: One V26000 PAR options: -ol high –t 1

Bubblesort Results – 100 Cells 29,354 Slices(86%) 37,131 LUTs (54%) 13.608 ns = 73 MHz (verified operational at 100MHz)

Heapsort Results – 95 Cells (48 Leafs) 21,011 Slices(62%) 24,467 LUTs (36%) 11.770 ns = 85 MHz (verified operational at 100MHz)


Testing ProceduresTesting Procedures

All tests utilize one chip for baseline resultsEvaluate fastest software radix of operationHardware/Software Partitioning

Five cases - Case 5 utilizes FPGA reconfiguration Data size partitioning – 100, 500, 1000, 5000, 10000 10 runs for each

test case/data partitioning combination

List size 500000 values


ResultsResults

Fastest Software Operations (Baseline) Comparison of Radixsort and Heapsort Combinations

• Radix 4, 8 and 16 evaluated

Minimum Time: Radix-8 Radixsort + Heapsort (Size = 5000 or 10000)

Radix-16 has too many buckets for sort size partitions evaluated Heapsort comparisons faster than radixsort index updates

Software Datasize Partitioning - Radixsort vs. Radixsort + Heapsort

01020304050607080

4 8 16 4 8 16 4 8 16 4 8 16 4 8 16 4 8 16

Radixsort Radix + Heap(Listsize=100)

Radix + Heap(Listsize=500)




TestCase/Radix

Tim

e (

se

c.)

HeapSort

RadixSort


ResultsResults

Fastest SW-only Time = 3.41 sec.

Fastest time including HW = 3.89 sec. Bubblesort

(HW), Heapsort (SW)

Partition Listsize of 1000

Heapsort times… Dominated by data access Significantly slower than software

SRC Softw are/Hardw are Executions (500K Data)

0

5

10

15

20

25

30

35

S-S

H-S

S-H

H-H

S-S

H-S

S-H

H-H

S-S

H-S

S-H

H-H

S-S

H-S

S-H

H-H

S-S

H-S

S-H

H-H

100 500 1000 5000 10000

Data Partition/Test Case

Tim

e (

se

c.)

Heapsort (HW)

Heapsort Config (HW)

Heapsort (SW)

Bubblesort (HW)

Bubblesort Config (HW)

Radixsort (SW)


Results – Bubblesort vs. RadixsortResults – Bubblesort vs. Radixsort

Some cases where HW faster than SW List sizes < 5000 SRC data

pipelined access Fastest SW case

was for list size = 10000

Radixsort (SW) vs. Bubblesort (HW)

0

1

2

3

4

5

6

Rad

ixso

rt(S

W)

Bub

bles

ort

(HW

)

Rad

ixso

rt(S

W)

Bub

bles

ort

(HW

)

Rad

ixso

rt(S

W)

Bub

bles

ort

(HW

)

Rad

ixso

rt(S

W)

Bub

bles

ort

(HW

)

Rad

ixso

rt(S

W)

Bub

bles

ort

(HW

)

100 500 1000 5000 10000

Data Size/Test Case

Tim

e (s

ec.) HW - Data Transfer Out

HW - Data Processing

HW - Data Transfer In

SW - Only

MAP data transfer time less significant than data processing time For size = 1000:

Input (11.3%), Analyze (76.9%), Output (11.5%)


Results - LimitationsResults - Limitations

Heapsort is limited by overhead of input servicing Random accesses of OBM not ideal Overhead of loop search, sequentially dependent

processingBubblesort limited by number of cells

Can increase by approximately 13 cells Two-chip streaming

Reconfiguration time assumed to be one-time setup factor Reconfiguration case exception – Solve by having a

core per V26000


ConclusionsConclusions

Pipelined, systolic designs are needed to overcome speed advantage of microprocessor Bubblesort works well on small data sets Heapsort’s random data access cannot exploit SRC

benefitsSRC high-throughput data transfer and high-

level data abstraction provides good framework to implement systolic designs


Future WorkFuture Work

Heapsort’s random data access cannot exploit SRC benefits Look for possible speedups using BRAM? Unroll leaf memory access Exploit SRC “periodic macro” paradigm

Currently evaluating radix sort in hardware This works better than bubblesort for larger sort sizes

Compare MAP-C to VHDL when baseline VHDL is faster than SW

allen michalski cse department – reconfigurable computing lab university of south carolina...

Documents