load balancing switch
Post on 24-Feb-2016
51 Views
Preview:
DESCRIPTION
TRANSCRIPT
1
LOAD BALANCING SWITCH
By: Maxim Fudim Oleg Schtofenmaher Supervisor: Walter Isaschar
PROJECT POSTER
Winter - Spring 2008
2
Abstract
Software solutions for real-time are too slow
Power dissipation limits work frequencies
Greater computing power neededH/W accelerators may improve S/W
processesMulti-core, multi-threaded systems
are the future
3
Multiprocessor environment for parallel processing of vectors data stream
Maximal ThroughputConfigurable hardwareExpandable designStatistics report
Project Goals
4
System specifications
SW over transparent HWInterface over PCI 1 Mbit/sec input streamVectors of 8 ÷ 1024 chunksVariable number of processorsSystem spreads over multiple FPGAs
5
Problem How to manage Data stream? How to manage multiple parallel units? How to achieve full and effective
utilization of resources?
6
Solution (Top Level)Board Level Load Balancing SwitchOne system input and output to
PCIDistribute vectors among classes Local buffers for chip data
7
Solution (Chip Level)
Chip Level Load Balancing SwitchConverting shared resources to
“personal” work space.Cluster ‘s organized VPUsMonitoring for each unit’s loadSmart arbitrationFlexible and easy configuration
8
Solution - Tree Distribution Switch
Class of Service Distribution
SW/HW interface
Clusters of VPUs
Clusters of VPUsClusters
of VPUs
LBS Arbitration
Clusters of VPUs
Clusters of VPUsClusters
of VPUs
LBS ArbitrationCluster
s of
VPUs
Clusters
of VPUs
Clusters
of VPUs
LBS Arbitration
9
Three level Architecture
Provide level for packets management ( Classes )Type, Size, Priority of Data
Provide level for organizing various processing units ( Clusters )Speed , Quantity, Resources of
Processors
Provide level for fine tuning ( VPUs ) Algorithm, HW accelerating
10
Implementation
Multi chip system connected over two busses
Input and Controls over Main BusOutput via streamed neighbored
bussesLocal FIFOs for every chip/classClassifier for packet managementSW configurable controlsCluster organized VPUs with in/out
arbitrationWatchdogs & Statistics Gathering
11S/W emulator or H/W DSP system
Board Level diagram
Input vectorsOutput reports
LBS1
Classifier
Stratix II 180
PROCStar IIPCI Bus
DDR2 DDR2
LBS2
DDR2 DDR2 DDR2 DDR2 DDR2 DDR2
LBS3 LBS4
Main Bus : Data In and Controls
Stratix II 180 Stratix II 180 Stratix II 180
Ring Bus
Ring Bus
Per LBS registers
12
Single FPGA Top Diagram
Load Balancing
Switch
(LBS)
DDR2Controls Bank A
LBS 1-4
Stratix II 180 FPGA
DDR2 Controls
Bank B
I/O – LBSControl Block
Data flow
NIOScluster
NIOScluster
NIOScluster
NIOScluster
NIOScluster
NIOScluster
NIOScluster
NIOScluster
NIOScluster
NIOScluster
NIOScluster
NIOScluster
NIOScluster
NIOScluster
NIOScluster
NIOScluster
BusControl Block
13
Data Packet Format
Header Data 1 to N of 32-bit
WordsTail
……
Unused
Nios Numb
er Data
Length NVector ID/Command
Type
8-bit 32-bit16-bitVersion 4-bit
SW/HW Control 1-bit
Type 1-bit(Data/
Command)
Tail : Sync Data
Header:
14
LBS Class Top Level View
Main Controller
unit
Stratix II FPGA
Output Writer
Cluster ArbiterNIOS II Syste
m
Input Reader
Cluster ArbiterNIOS II Syste
m
Control
Control
FIFO Input Port
FIFOOutput
Port
Control
Cluster ArbiterNIOS II Syste
mMuxed output data bus
Input data bus
Controland Status
Statistics
Reporter
Buss
es C
ontr
ol B
lock
Organization of VPU’s(Vector Processing Units)NIOS VPUs joined into the clustersConstant number of ClustersParametric number of NIOS VPU’s
in clusterParametric control & distribution
logicVarious configurations of NIOS Static/Dynamic Priority Arbitration
16
Single processor with in/out buffers
HW accelerated systemShared resources system with
mutexMulti- processors system with
number of ports to Cluster
LBS Units DescriptionVPUs: NIOS System
17
Resource Usage
ModuleLogic
utilization
%Memor
y (M4K)
%
Peripheral IPs (MegaFIFO, PLLs, etc.) 3,100 2 16 2
User System (All VPUs + LBS) 42,000 30 675 88
Single VPU 6,775 4.7 112 15LBS Logic 1,350 1 3 0.5Total usage of chip resources 45,896 32 691 90
Total available 143,000 100 768 100
Resource usage data for 6 VPU system
VPU resource usage is based on basic VPUs and may be decreased by advanced configurations and policies.
18
Performance of LBSTheoretical Throughput:
100MHz x 64bit = 6.4Gbit/sArbitration and routing latency:
2-4 cycles in average60% effective bandwidth utilization
for short vectors, up to 98% for long vectors
1Mbit/s – 400 Mbit/s real throughput PCI and slow algorithms =
bottlenecks
19
Performance for short vectors
SystemTime ofService[sec]
Throughput[Mbit/s] Impr
SW(on Core2Duo
E6600)0.1 3.2
6 VPUs 0.00209 122 382 Classes of 6 VPUs 0.00134 191 603 Classes of 6 VPUs 0.00086 297 93
4 Classes of 6 VPUs 0.00064 400 125
Time and throughput for 1000 vectors of 4 chunks each
VPU performance is based on basic VPUs and RR arbitration and may be increased for giving workload after perf. analysis by defining advanced configurations and policies.
20
Performance for medium vectors
SystemTime ofService[sec]
Throughput[Mbit/s] Impr
SW(on Core2Duo
E6600)2.9 2.3
6 VPUs 0.28 23.4 102 Classes of 6 VPUs 0.15 43.5 18.53 Classes of 6 VPUs 0.01 66 28.7
4 Classes of 6 VPUs 0.074 88 38
Time and throughput for 1000 vectors of 200 chunks each
VPU performance is based on basic VPUs and RR arbitration and may be increased for giving workload after perf. analysis by defining advanced configurations and policies.
21
Performance for long vectors
SystemTime ofService[sec]
Throughput[Mbit/s] Impr
SW(on Core2Duo
E6600)1.1 2.9
One VPU 1.224 2.62 0.896 VPUs 0.208 15.43 5.3
2 Classes of 6 VPUs 0.11 29.1 103 Classes of 6 VPUs 0.074 43.69 14.8
4 Classes of 6 VPUs 0.061 52.46 18
Time and throughput for 100 vectors of 1000 chunks each
VPU performance is based on basic VPUs and RR arbitration and may be increased for giving workload after perf. analysis by defining advanced configurations and policies.
22
System performance – missing TOAs
0 2 4 6 8 10 12 14 16 18 200.000
0.100
0.200
0.300
0.400
0.500
Missing TOAs
HardwareSoftware
Number of missing TOAs
Proc
essi
ng t
ime
[sec
]
23
System performance – noise levels
0 5 10 15 20 25 30 35 40 45 500.000
0.100
0.200
0.300
0.400
0.500
0.600
Noise percentage
HardwareSoftware
Noise [%]
Proc
essi
ng t
ime
[sec
]
top related