IEEE Workshop on HSLN16 Nov 2004 1
Simulative Analysis of theSimulative Analysis of theRapidIO Embedded Interconnect Architecture RapidIO Embedded Interconnect Architecture for Real-Time, Network-Intensive Applicationsfor Real-Time, Network-Intensive Applications
David Bueno, Adam Leko, Chris Conger,
Ian Troxel, and Alan D. George
High-Performance Networking (HPN) Group
HCS Research Laboratory
University of Florida
IEEE Workshop on HSLN16 Nov 2004 2
Presentation Outline
Introduction Background
Rapid IO Technology Overview Ground Moving Target Indicator (GMTI) Overview
Simulation and Modeling Environment Experimental Setup Results Conclusions
IEEE Workshop on HSLN16 Nov 2004 3
Introduction Big impetus to provide more processing power on-board satellites
More powerful radiation-hardened components available Strive to reduce downlink requirements
Today’s satellite systems typically built around an expensive custom or COTS bus interconnect (e.g. cPCI, VME) Scalability, bandwidth, and latency limitations Solutions tend to be “one-off” designs with much non-recurring engineering
High on-board data rate requirements and desire to reduce custom design make COTS-based embedded networks an attractive solution
Image courtesy http://www.afa.org
RapidIO (RIO) a leading contender High-performance, switched embedded interconnect Scales to connect many nodes Better bisection bandwidth than bus-based technologies Less “hand-coded” synchronization and arbitration required
Example:A 64-bit, 33 MHz cPCI bus provides ~2Gbps of throughput, while a single 8-bit DDR 250 MHz RapidIO endpoint provides ~4Gbps of throughput. Even a modest RIO system of such links provides tens of gigabits per second of aggregate throughput with many non-blocking links.
IEEE Workshop on HSLN16 Nov 2004 4
Background – Rapid IO Relatively new technology, with
limited research to date White paper out January 2002 First specification document
published June 2002 A set of formal specifications,
published by RapidIO Trade Organization (RTO)
Support from many companies Motorola, IBM, TI, Xilinx, Lucent,
Agere, Analog Devices, Ericsson, Altera, among others
Several are offering products Xilinx, Motorola, Redswitch, and
Praesum
Document DescriptionMulticast Specification Defines a method for RIO switch-based
multicast
Streaming Specification Defines a method for protocol-independent encapsulation of payloads up to 64K bytes
system bringup spec.pdf Provides standard approaches for RapidIO system bring-up, device enumeration, routing table management, software and hardware abstraction layers and APIs
fcspec.pdf Flow control spec. rev1.0
errata1.pdf Revision 1.2 Errata 1
hipspec.pdf HIP doc. rev1.0
errspec.pdf Error spec. rev1.2
RapidIO.pdf Main spec. rev1.2
serial.book.pdf Serial spec. rev1.2
inter-op.pdf Inter-operability spec. rev1.2
oview.pdf Spec. overview rev1.2
gsmlspec.pdf GSM spec. rev1.2
Rapid IO Specification Documents c/o RTO
IEEE Workshop on HSLN16 Nov 2004 5
Background – Rapid IO Three-layered, embedded system interconnect architecture
Logical – memory mapped I/O, message passing, and global shared memory Transport – routing based on packet destination ID Physical – serial and 8- or 16-bit parallel at 250, 500, or 1000 MHz
Point-to-point, packet-switched interconnect Targeted for inter-processor and inter-board embedded interconnects Peak single-link throughput ranging from 2 to 64 Gb/s Focus on 16-bit parallel LVDS RIO implementation for satellite systems
Image courtesy G. Shippen, “RapidIO Technical Deep Dive
1: Architecture & Protocol,” Motorola Smart Network
Developers Forum, 2003.
IEEE Workshop on HSLN16 Nov 2004 6
Background – Rapid IO Uses Low-Voltage Differential Signaling (LVDS) to minimize power Employs fabrics in form of Multistage Interconnection Networks (MINs)
to allow communication between arbitrary devices Two types of packets: control and data Message-passing logical layer
Provides traditional message-passing interface with mailbox-style delivery Request and response messages between endpoints Supports 26 message priorities with segmenting up to 4096B
Trans Recv Trans Recv
Write 0
Write 1
Write 2
Write 3
Write 4
Write 2
Write 3
Write 4
Ack 0
Ack 1
Rtry 2
Ack 2
Ack 3
Write 0
Write 1
Write 2
Write 3
Write 4
Ack 0, 2 buff avail
Ack 1, 1 buff avail
Ack 2, 0 buff availIdle, 0 buff avail
Idle, 2 buff avail
Ack 3, 3 buff avail
Ack 4, 2 buff avail
Idle, 0 buff avail
(a) (b)
(a) Receiver- and (b) Transmitter-controlled flow control
Physical layer Only supports 4 priority levels Error detection
Supported directly via CRC for regular packets Inverted bitwise replication of symbols for
control packets Error recovery accomplished via Go-Back-N
sliding window retransmission of damaged packets
Transmitter or receiver flow control supported at link level
IEEE Workshop on HSLN16 Nov 2004 7
Background – GMTI Space-based RADAR: GMTI detects and tracks moving targets on ground
Important use in military applications Typified by large data sets and high computation requirements
Algorithm decomposed into multiple sub-tasks Incoming data set viewed as 3-dimensional “data cube”
Size of each cube dictated by Coherent Processing Interval (CPI) Each task has an ideal dimension for partitioning and processing
If partitioned along optimum dimension for a particular task, no inter-processor communication necessary during processing of that task
Data reorganized in-between tasks if necessary by performing a corner-turn Size of resulting data is orders of magnitude smaller than incoming data
Completing processing on-board greatly reduces amount of downlink throughput required from satellite to Earth
PulseCompression
DopplerProcessing
Space-TimeAdaptive
Processing(STAP)
ConstantFalse Alarm
Rate(CFAR)
Receive Cube
Send Results
Corner Turn Partitioned along range dimension
Partitioned along pulse dimension
GMTI algorithm flow and processing task breakdown
DATA CUBE
Beam
s
Ranges
Pu
lses
Data cube dimensions
IEEE Workshop on HSLN16 Nov 2004 8
GMTI – Parallel Partitioning Straightforward partitioning
Entire system works in parallel on a single data cube
All-to-all personalized communication used to perform corner-turn
Result latency must be ≤ 1 CPI
Staggered partitioning Processors work in small groups, one data
cube for each group Incoming data cubes are sent to groups via
round-robin distribution Result latency must be ≤ N × CPI
N = number of processor groups
Pipelined partitioning Processors work in small groups, each group
responsible for one stage of algorithm Corner turns “for free” Result latency must be ≤ N × CPI
N = number of stages
Data cube dimensions
time
1 CPI
PE #4
PE #3
PE #2
PE #1
PE #4
PE #3
PE #2
PE #1
PE #4
PE #3
PE #2
PE #1
PE #4
PE #3
PE #2
PE #1
PC DP STAP CFAR
Straightforward Partitioning
Pipelined Partitioning
Data Cube0
Data Cube1
Data Cube2
Data Cube3
Data Cube4
Data Cube5
timestart
CPI 0 CPI 1 CPI 2 CPI 3 CPI 4
Staggered Partitioning
IEEE Workshop on HSLN16 Nov 2004 9
Simulation & Modeling Environment Modeling library created using Mission Level Designer (MLD), a commercial
discrete-event simulation modeling tool from MLDesign Technologies C++-based, block-level, hierarchical modeling tool
Algorithm modeling accomplished via custom C++ primitives Created different processor models for different phases of the algorithm Processor model approximates vector DSP processor
Our model library includes: RIO central-memory switch Compute node with RIO endpoint GMTI traffic source/sink RIO logical message-passing layer Transport and parallel physical
layers
Model of Compute Nodewith RIO Endpoint
IEEE Workshop on HSLN16 Nov 2004 10
RapidIO Models Key features of Endpoint model
Message-passing logical layer Transport layer Parallel physical layer
Transmitter- and receiver-controlled flow control Error detection and recovery Priority scheme for buffer management Adjustable link speed and width Adjustable priority thresholds and queue lengths
Key features of Central-memory switch model Selectable cut-through or store-and-forward routing TDM model for memory access (approximated with average delay) Adjustable priority thresholds based on free switch memory Adjustable link rates, etc. similar to endpoint model
Model of RIOCentral-Memory Switch
IEEE Workshop on HSLN16 Nov 2004 11
System Models High throughput requirements for data source and data redistribution in pipelined
partitioning require non-blocking connectivity between all nodes and data sources Custom network topologies created for 8-, 12-, 16-, and 24-processor systems Network topologies favored communication patterns of pipelined partitioning scheme
Most communication-intensive partitioning scheme Algorithm performance of other schemes not sensitive to topologies
24-node topology shown below (others similar) Grey = Switch Red = Pulse compression node
DataSource
Blue = Doppler node Green = STAP node Orange = CFAR node
IEEE Workshop on HSLN16 Nov 2004 12
Simulation Experiment Setup
Built system models with 8, 12, 16, 24 compute nodes For each experiment, 8 CPIs worth of data sent to processors
and processed Key simulation parameters
16-bit parallel RapidIO 250 MHz DDR clock rate 4.6 Gb/s incoming GMTI data rate 10 KB switch central memory size Cut-through routing on/off Transmitter- or receiver-controlled flow control
Key simulation outputs CPI completion latency Average packet latency System and application bandwidth
IEEE Workshop on HSLN16 Nov 2004 13
System Bandwidth Measurements Overall system b/w = total bytes transferred ÷ total simulated time Application b/w = total payload transferred ÷ total simulated time Pipelined method requires most redistribution of data, consumes
most bandwidth
0
2
4
6
8
10
12
Da
ta r
ate
(G
bp
s) Overall system bandwidth
Application bandwidth
Straightforward and staggered methods are comparable
Gap between pipelined bars indicates communication inefficiencies
IEEE Workshop on HSLN16 Nov 2004 14
Packet Overhead Efficiency
Communication efficiency = total payload transferred ÷ total bytes transferred
Pipelined method is consistently least efficient (as indicated on previous slide)
0.8
0.82
0.84
0.86
0.88
0.90.92
0.94
0.96
0.98
1
Co
mm
un
ica
tio
n e
ffic
ien
cy Data groupings of
pipelined method require many packets to be sent that are < RIO max of 256 bytes Packets that are not even
multiples of a 32B word must also be “padded” with dummy data
IEEE Workshop on HSLN16 Nov 2004 15
RIO Fabric Considerations
Cut-through routing Provided reduced packet
latencies Did not improve performance
of overall application CPI completion latency
remains same GMTI is bandwidth-intensive
but not sensitive to latency of individual packets
Flow-control method Transmitter-controlled flow
control did not provide improvements over receiver-controlled baseline method
0
1000
2000
3000
4000
5000
6000
7000
8 no
de, p
ipel
ined
8 no
de, s
traig
htforw
ard
8 no
de, s
tagg
ered
12 n
ode, p
ipel
ined
12 n
ode, s
traig
htforw
ard
12 n
ode, s
tagger
ed
16 n
ode, p
ipel
ined
16 n
ode, s
traig
htforw
ard
16 n
ode, s
tagger
ed
24 n
ode, p
ipel
ined
24 n
ode, s
traig
htforw
ard
24 n
ode, s
tagger
ed
Av
era
ge
pa
ck
et
de
lay
(n
s)
BaselineCut-through
Tx-controlled flow control does eliminate packet transmission attempts when receiver buffers unable to accept (at no performance cost over Rx-controlled flow control) Could save power in some systems
IEEE Workshop on HSLN16 Nov 2004 16
System Throughput 24-node systems needed to meet real-time deadline (256 msec CPI, red line) As modeled, pipelined method performs worst
True benefit of pipelining comes from stringing together smaller, cheaper specialized processing elements If this can be done in implementation, cost/performance benefits can be gained
Staggered method can sometimes leave processors idle, reducing throughput Straightforward method also has best CPI latency, since all PEs work on each
CPI
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
CP
Is /
se
co
nd
IEEE Workshop on HSLN16 Nov 2004 17
Recent Results, Additional Experiments
Using much larger input data set sizes and more efficient system layouts 24 Gbps vs. 4.6 Gbps input data set used here Still using 28-node system
Result latency comparison Interval from input data arrival to reporting of results Recall that deadline to meet is 256 ms
This deadline is extended to a multiple of 256 ms for staggered and pipelined methods
Some communication-computation overlap is acceptable assuming DMA
Free switch-memory histograms Shows percent of time switch spent with different amounts
of free memory Reveals congestion or confirms efficient routing Spikes or bumps in low free memory brackets imply
contention for a particular port Histograms generated for every switch in the system
Future research Our RapidIO research is an on-going effort Upcoming studies will consider different logical layers
Memory-mapped logical layer already being modeled In addition to GMTI, Synthetic Aperture Radar (SAR) to be
simulated and studied in RapidIO-based systems
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0 3276.8 6553.6 9830.4 13107.2
Free memory (bytes)
Fre
qu
en
cy
Free switch memory histogram
Result latency comparison
0
256
512
768
1024
1280
1536
32000 40000 48000 56000 64000
Number of ranges
Lat
ency
(m
s)
Straightforward, 5 boards
Staggered, 5 boards
Pipelined, 6 boards
Pipelined, 7 boards
IEEE Workshop on HSLN16 Nov 2004 18
Conclusions RapidIO provides feasible path to flight for space-based radar
Throughput capability and interconnect scalability of RapidIO provide sufficient infrastructure for compute-intensive applications
Future work to focus on additional SBR variants (e.g. Synthetic Aperture Radar, SAR) and experimental RIO analysis
Developed suite of simulation models and mechanisms for evaluation of RapidIO designs for space-based radar applications et al.
Flexibility in system design using RapidIO interconnect allows range of system topologies to support various algorithm partitionings Straightforward method provides lowest completion latencies, pipelined
method suffers from some communication inefficiencies Recent work shows systems scalable to more nodes, larger data cube
sizes with greater processing/network requirements GMTI result latency does not benefit from cut-through routing,
selection of either Rx- or Tx-controlled flow control Other applications may benefit more from these features of RapidIO Flow control method may offer other benefits, such as lower power
consumption
IEEE Workshop on HSLN16 Nov 2004 19
Acknowledgements
This research was funded in part by Honeywell Defense and Space Electronic Systems (DSES), Clearwater FL.
Thanks are also extended to MLDesign Technologies in Palo Alto, CA for use of their MLD software tools.