reconfigurable computing: a first look at the cray-xd1

Post on 31-Dec-2015

38 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

Reconfigurable Computing: A First Look at the Cray-XD1. Craig Ulmer. Mitch Sukalski, David Thompson, Rob Armstrong, Curtis Janssen, and Matt Leininger Orgs: 8961 & 8963. September 1, 2004. Outline. Reconfigurable computing refresher Progress update Cray XD1 Architecture - PowerPoint PPT Presentation

TRANSCRIPT

Reconfigurable Computing:A First Look at the Cray-XD1

Mitch Sukalski, David Thompson, Rob Armstrong,Curtis Janssen, and Matt Leininger

Orgs: 8961 & 8963

September 1, 2004

Craig Ulmer

Outline

• Reconfigurable computing refresher– Progress update

• Cray XD1– Architecture

– General message passing

– Reconfigurable Computing and the XD1

Reconfigurable Computing Update

Reconfigurable Computing

• Use reconfigurable hardware devices to implement key computations in hardware

double doX( double *a, int n) {int i;double x;

x=0;for(i=0;i<n;i+=3){

x+= a[i] * a[i+1] + a[i+2];…

}…

return x;}

* +

+

a[i] a[i+1]

Z -1

a[i+2]

First Year Progress

• Computation (Underwood SNL/NM)– Double-precision Floating Point Cores

• Communication– Multi-gigabit Transceiver (MGT) interface– Gigabit Ethernet work

• Early application experiments– Simplified isosurfacing– Networked pattern matching

Peak Floating-Point Performance

Core

Single Precision Double Precision

SpeedCores per V2P100-6

Peak Performance

SpeedCores per V2P100-6

Peak Performance

Addition 195 MHz 89 17 GFLOPS 143 MHz 40 5.7 GFLOPS

Multiplication 176 MHz 74 13 GFLOPS 142 MHz 27 3.8 GFLOPS

Division 120 MHz 22 2.6 GFLOPS 98 MHz 6 0.58 GFLOPS

From Underwood’s, “FPGAs vs. CPUs: Trends in Peak Floating-Point Performance,” in FPGA’04

Connecting FPGAs to the Network Fabric

• Modern FPGAs feature multi-gigabit transceivers– Experimented with GigE, Myrinet 2000, and IB– Implemented TCP Offload Engine (TOE) in hardware– Working on OpenTOE and OpenGigE cores

MGTControl

TxIP Header

ARPPing

ARPCache

MAC Framer

Align

CRC

Rx

CRCGT_Ethernet_2

Rocket I/OMGT

Pad

PingReply

CRC

DecodeIncoming Data Queue

TimeoutMonitor

SEQGen

ACKMonitor

CRCGen

ARPReply

Outgoing Data Queue

SNL_OpenTOE

TCP

I/F

Socket

I/F

SNL_OpenGigE

Cray XD1 Overview

NDA Notice

We do have an NDA with Cray Canada

The XD1 we have on loan is an early Beta system

Cray XD1 Overview

• Dense MP system– 12 AMD Opterons on 6 blades

– 6 Xilinx Virtex-II/Pro FPGAs

– InfiniBand-like interconnect

– 6 SATA hard drives

– 4 PCI-X slots

– 3U Rack

Individual Blade

DDRMemory

DDRMemory

RAPNI

Opteron Opteron

RapidArray Fabric(24 4x IB Ports)

* All data rates are aggregates (i.e., 3.2 GB/s = 1.6 GB/s + 1.6 GB/s)

HT: 3.2 GB/s

4xIB: 2 GB/s

HT: 6.4 GB/s

“Einstein”Chip

“HT”: 3.2 GB/s

RAPNI

RapidArray Fabric(24 4x IB Ports)

Message Passing

• MPICH 1.2.5– Latency: 2.25 μs– Bandwidth: 1.3 GB/s

(82% of HT-IB link)

• RapidArray message layer– Open source– MP, RDMA– Global address space

0

200

400

600

800

1,000

1,200

1,400

1,600

1,800

1 10 100 1,000 10,000 100,000 1,000,000 10,000,000

Message Size (Bytes)

Ba

nd

wid

th (

Mill

ion

Byt

es/

s)

MPI Bandwidth

PCI-X 133

1.6GB/s HT

System Administration

• Active manager– Synchronize each node’s OS

– Partition blade functionality

– Control access rights

• Embedded processor– Monitors health (heartbeats)

– Can restart nodes

• Issues?

Reconfigurable Computing and the Cray XD1

Connecting to the “Einstein” Accelerator

RAPNI

Host HT

Net IB

HTUser-defined

Circuits

FPGA

HTI/F

FPGAPort

FabricPort

1.6+1.6GB/s

QDR2I/F

QDR2I/F

QDR2I/F

QDR2I/F

2MBSRAM

2MBSRAM

2MBSRAM

2MBSRAM

1.6+1.6GB/s

Example: Random Number Generator

• Monte Carlo app in need of good random numbers– Mersenne twister

• Implemented in FPGA– FPGA pushes to host memory

– 301 vs 101 Million Integers/s

– ~1.2 GB/sNI

CPUHost

Memory

RNG

FPGA

General XD1 Comments

• Reconfigurable computing– FPGA in memory

– Fast local memory

• Other accelerators– ClearSpeed

• Global address space– Opteron limits (40b PA)

• Vendor lock-in– Incompatible network

– All-in-one box?

• Current NI is a bottleneck

• Density vs. Reliability

• Value-added features

Good Not-so-Good

Friendly Users?

• We have a month left on evaluation– Could use feedback from other users

http://cdulmer.ran.sandia.gov/xd1 cdulmer@sandia.gov

top related