allen michalski cse department – reconfigurable computing lab university of south carolina...

22
Allen Michalski CSE Department – Reconfigurable Computing Lab University of South Carolina Microprocessors with FPGAs: Implementation and Workload Partitioning of the DARPA HPCS Integer Sort Benchmark within the SRC-6e Reconfigurable Computer

Upload: elisabeth-robinson

Post on 29-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Allen Michalski CSE Department – Reconfigurable Computing Lab University of South Carolina Microprocessors with FPGAs: Implementation and Workload Partitioning

Allen MichalskiCSE Department – Reconfigurable Computing Lab

University of South Carolina

Microprocessors with FPGAs: Implementation and Workload Partitioning of the DARPA HPCS

Integer Sort Benchmark within the SRC-6e Reconfigurable Computer

Page 2: Allen Michalski CSE Department – Reconfigurable Computing Lab University of South Carolina Microprocessors with FPGAs: Implementation and Workload Partitioning

Page 2 MAPLD 2005/253Michalski

OutlineOutline

Reconfigurable Computing – Introduction SRC-6e architecture, programming model

Sorting Algorithms Design guidelines

Testing Procedures, ResultsConclusions, Future Work

Lessons learned

Page 3: Allen Michalski CSE Department – Reconfigurable Computing Lab University of South Carolina Microprocessors with FPGAs: Implementation and Workload Partitioning

Page 3 MAPLD 2005/253Michalski

What is a Reconfigurable Computer?What is a Reconfigurable Computer?

Combination of: Microprocessor workstation for frontend processing FPGA backend for specialized coprocessing Typical PC bus for communications

Page 4: Allen Michalski CSE Department – Reconfigurable Computing Lab University of South Carolina Microprocessors with FPGAs: Implementation and Workload Partitioning

Page 4 MAPLD 2005/253Michalski

What is a Reconfigurable Computer?What is a Reconfigurable Computer?

PC Characteristics High clock speed Superscalar, pipelined Out of order issue Speculative execution High-Level Language

FPGA Characteristics Low clock speed Large number of configurable

elements• LUTs, Block RAMs, CPAs• Multipliers

HDL Language

Page 5: Allen Michalski CSE Department – Reconfigurable Computing Lab University of South Carolina Microprocessors with FPGAs: Implementation and Workload Partitioning

Page 5 MAPLD 2005/253Michalski

What is the SRC-6e?What is the SRC-6e?

SRC = Seymour R. Cray RC with high-throughput memory interface

1,415 MB/s for SNAP writes, 1,280 MB/s for SNAP reads PCI-X (1.0) = 1.064 GB/s

Page 6: Allen Michalski CSE Department – Reconfigurable Computing Lab University of South Carolina Microprocessors with FPGAs: Implementation and Workload Partitioning

Page 6 MAPLD 2005/253Michalski

SRC-6e DevelopmentSRC-6e Development

Programming does not require knowledge of HW design C code can compile to hardware

Page 7: Allen Michalski CSE Department – Reconfigurable Computing Lab University of South Carolina Microprocessors with FPGAs: Implementation and Workload Partitioning

Page 7 MAPLD 2005/253Michalski

FPGA Considerations Superscalar design

• Parallel, pipelined executionSRC Considerations

High overall data throughput• Streaming versus non-streaming data transfer?

Reduction of FPGA data processing stalls due to data dependencies, data read/write delays

• FPGA Block RAM versus SRC OnBoard Memory?Evaluate software/hardware partitioning

Algorithm partitioning Data size partitioning

SRC Design ObjectivesSRC Design Objectives

Page 8: Allen Michalski CSE Department – Reconfigurable Computing Lab University of South Carolina Microprocessors with FPGAs: Implementation and Workload Partitioning

Page 8 MAPLD 2005/253Michalski

Sorting AlgorithmsSorting Algorithms

Traditional Algorithms Comparison Sorts: Θ(n lg n) best case

• Insertion sort• Merge sort• Heapsort• Quicksort

Counting Sorts• Radix sort: Θ(d(n+k))

HPCS FORTRAN code baseline Radix sort in combination with heapsort This research focuses on 128-bit operands

• SRC simplified data transfer, management

Page 9: Allen Michalski CSE Department – Reconfigurable Computing Lab University of South Carolina Microprocessors with FPGAs: Implementation and Workload Partitioning

Page 9 MAPLD 2005/253Michalski

Memory Constraints SRC onboard memory

• 6 banks x 4 MB• Pipelined read or write access• 5 clock latency

FPGA BRAM memory• 144 blocks, 18 Kbit each• 1 clock read and write latency

Initial Choices Parallel Insertion Sort (BubbleSort)

• Produces sorted blocks• Use of onboard memory pipelined processing

– Minimize data access stalls Parallel Heapsort

• Random access merge of sorted lists• Use of BRAM for low latency access

– Good for random data access

Sorting – SRC FPGA ImplementationSorting – SRC FPGA Implementation

Page 10: Allen Michalski CSE Department – Reconfigurable Computing Lab University of South Carolina Microprocessors with FPGAs: Implementation and Workload Partitioning

Page 10 MAPLD 2005/253Michalski

Parallel Insertion Sort (BubbleSort)Parallel Insertion Sort (BubbleSort)

Systolic array of cells Pipelined SRC processing from OnBoard Memory Keeps highest value, passes other values Latency 2x number of cells

Page 11: Allen Michalski CSE Department – Reconfigurable Computing Lab University of South Carolina Microprocessors with FPGAs: Implementation and Workload Partitioning

Page 11 MAPLD 2005/253Michalski

Parallel Insertion Sort (BubbleSort)Parallel Insertion Sort (BubbleSort)

Systolic array of cells Results passed out in

reverse order of comparison

• N = # comparator cells Sorts a list completely in

Θ(L2) Limit sort size to some

number a < L (list size)• Create multiple sorted lists• Each list sorted in Θ(a)

Page 12: Allen Michalski CSE Department – Reconfigurable Computing Lab University of South Carolina Microprocessors with FPGAs: Implementation and Workload Partitioning

Page 12 MAPLD 2005/253Michalski

Parallel Insertion Sort (BubbleSort)Parallel Insertion Sort (BubbleSort)#include <libmap.h>void parsort_test(int arraysize, int sortsize, int transfer, uint64_t datahigh_in[], uint64_t datalow_in[], uint64_t datahigh_out[], uint64_t datalow_out[], int64_t *start_transferin, int64_t *start_loop, int64_t *start_transferout, int64_t *end_transfer, int mapno) {

OBM_BANK_A (a, uint64_t, MAX_OBM_SIZE) OBM_BANK_B (b, uint64_t, MAX_OBM_SIZE) OBM_BANK_C (c, uint64_t, MAX_OBM_SIZE) OBM_BANK_D (d, uint64_t, MAX_OBM_SIZE)

DMA_CPU(CM2OBM, a, MAP_OBM_stripe(1, "A"), datahigh_in, 1, arraysize*8, 0); wait_DMA(0); …. while (arrayindex < arraysize) { endarrayindex = arrayindex + sortsize - 1; if (endarrayindex > arraysize - 1) endarrayindex = arraysize - 1;

while (arrayindex < endarrayindex) { for (i=arrayindex; i<=endarrayindex; i++) { data_high_in = a[i];data_low_in = b[i];

parsort(i==endarrayindex, data_high_in, data_low_in, &data_high_out, &data_low_out);

c[i] = data_high_out; d[i] = data_low_out;

Page 13: Allen Michalski CSE Department – Reconfigurable Computing Lab University of South Carolina Microprocessors with FPGAs: Implementation and Workload Partitioning

Page 13 MAPLD 2005/253Michalski

Parallel HeapsortParallel Heapsort

Tree structure of cells Asynchronous operation

• Acknowledged data transfer Merges sorted lists in Θ(n lg n) Designed for Independent BRAM block accesses

Page 14: Allen Michalski CSE Department – Reconfigurable Computing Lab University of South Carolina Microprocessors with FPGAs: Implementation and Workload Partitioning

Page 14 MAPLD 2005/253Michalski

Parallel HeapsortParallel Heapsort

BRAM Limitations 144 Block RAMs @ 512 32 bit values = not a whole

lot of 128-bit valuesOnBoard Memory

SRC constraint – Up to 64 reads and 8 writes in one MAP C file

Cascading clock delays as number of reads increase Explore the use of MUXd access: search and update

only 6 of 48 leaf nodes at a time in round-robin fashion

Page 15: Allen Michalski CSE Department – Reconfigurable Computing Lab University of South Carolina Microprocessors with FPGAs: Implementation and Workload Partitioning

Page 15 MAPLD 2005/253Michalski

FPGA Initial ResultsFPGA Initial Results

Baseline: One V26000 PAR options: -ol high –t 1

Bubblesort Results – 100 Cells 29,354 Slices(86%) 37,131 LUTs (54%) 13.608 ns = 73 MHz (verified operational at 100MHz)

Heapsort Results – 95 Cells (48 Leafs) 21,011 Slices(62%) 24,467 LUTs (36%) 11.770 ns = 85 MHz (verified operational at 100MHz)

Page 16: Allen Michalski CSE Department – Reconfigurable Computing Lab University of South Carolina Microprocessors with FPGAs: Implementation and Workload Partitioning

Page 16 MAPLD 2005/253Michalski

Testing ProceduresTesting Procedures

All tests utilize one chip for baseline resultsEvaluate fastest software radix of operationHardware/Software Partitioning

Five cases - Case 5 utilizes FPGA reconfiguration Data size partitioning – 100, 500, 1000, 5000, 10000 10 runs for each

test case/data partitioning combination

List size 500000 values

Page 17: Allen Michalski CSE Department – Reconfigurable Computing Lab University of South Carolina Microprocessors with FPGAs: Implementation and Workload Partitioning

Page 17 MAPLD 2005/253Michalski

ResultsResults

Fastest Software Operations (Baseline) Comparison of Radixsort and Heapsort Combinations

• Radix 4, 8 and 16 evaluated

Minimum Time: Radix-8 Radixsort + Heapsort (Size = 5000 or 10000)

Radix-16 has too many buckets for sort size partitions evaluated Heapsort comparisons faster than radixsort index updates

Software Datasize Partitioning - Radixsort vs. Radixsort + Heapsort

01020304050607080

4 8 16 4 8 16 4 8 16 4 8 16 4 8 16 4 8 16

Radixsort Radix + Heap(Listsize=100)

Radix + Heap(Listsize=500)

Radix + Heap(Listsize=1000)

Radix + Heap(Listsize=5000)

Radix + Heap(Listsize=10000)

TestCase/Radix

Tim

e (

se

c.)

HeapSort

RadixSort

Page 18: Allen Michalski CSE Department – Reconfigurable Computing Lab University of South Carolina Microprocessors with FPGAs: Implementation and Workload Partitioning

Page 18 MAPLD 2005/253Michalski

ResultsResults

Fastest SW-only Time = 3.41 sec.

Fastest time including HW = 3.89 sec. Bubblesort

(HW), Heapsort (SW)

Partition Listsize of 1000

Heapsort times… Dominated by data access Significantly slower than software

SRC Softw are/Hardw are Executions (500K Data)

0

5

10

15

20

25

30

35

S-S

H-S

S-H

H-H

S-S

H-S

S-H

H-H

S-S

H-S

S-H

H-H

S-S

H-S

S-H

H-H

S-S

H-S

S-H

H-H

100 500 1000 5000 10000

Data Partition/Test Case

Tim

e (

se

c.)

Heapsort (HW)

Heapsort Config (HW)

Heapsort (SW)

Bubblesort (HW)

Bubblesort Config (HW)

Radixsort (SW)

Page 19: Allen Michalski CSE Department – Reconfigurable Computing Lab University of South Carolina Microprocessors with FPGAs: Implementation and Workload Partitioning

Page 19 MAPLD 2005/253Michalski

Results – Bubblesort vs. RadixsortResults – Bubblesort vs. Radixsort

Some cases where HW faster than SW List sizes < 5000 SRC data

pipelined access Fastest SW case

was for list size = 10000

Radixsort (SW) vs. Bubblesort (HW)

0

1

2

3

4

5

6

Rad

ixso

rt(S

W)

Bub

bles

ort

(HW

)

Rad

ixso

rt(S

W)

Bub

bles

ort

(HW

)

Rad

ixso

rt(S

W)

Bub

bles

ort

(HW

)

Rad

ixso

rt(S

W)

Bub

bles

ort

(HW

)

Rad

ixso

rt(S

W)

Bub

bles

ort

(HW

)

100 500 1000 5000 10000

Data Size/Test Case

Tim

e (s

ec.) HW - Data Transfer Out

HW - Data Processing

HW - Data Transfer In

SW - Only

MAP data transfer time less significant than data processing time For size = 1000:

Input (11.3%), Analyze (76.9%), Output (11.5%)

Page 20: Allen Michalski CSE Department – Reconfigurable Computing Lab University of South Carolina Microprocessors with FPGAs: Implementation and Workload Partitioning

Page 20 MAPLD 2005/253Michalski

Results - LimitationsResults - Limitations

Heapsort is limited by overhead of input servicing Random accesses of OBM not ideal Overhead of loop search, sequentially dependent

processingBubblesort limited by number of cells

Can increase by approximately 13 cells Two-chip streaming

Reconfiguration time assumed to be one-time setup factor Reconfiguration case exception – Solve by having a

core per V26000

Page 21: Allen Michalski CSE Department – Reconfigurable Computing Lab University of South Carolina Microprocessors with FPGAs: Implementation and Workload Partitioning

Page 21 MAPLD 2005/253Michalski

ConclusionsConclusions

Pipelined, systolic designs are needed to overcome speed advantage of microprocessor Bubblesort works well on small data sets Heapsort’s random data access cannot exploit SRC

benefitsSRC high-throughput data transfer and high-

level data abstraction provides good framework to implement systolic designs

Page 22: Allen Michalski CSE Department – Reconfigurable Computing Lab University of South Carolina Microprocessors with FPGAs: Implementation and Workload Partitioning

Page 22 MAPLD 2005/253Michalski

Future WorkFuture Work

Heapsort’s random data access cannot exploit SRC benefits Look for possible speedups using BRAM? Unroll leaf memory access Exploit SRC “periodic macro” paradigm

Currently evaluating radix sort in hardware This works better than bubblesort for larger sort sizes

Compare MAP-C to VHDL when baseline VHDL is faster than SW