load balancing switch
DESCRIPTION
Final presentation for project. By: Maxim Fudim Oleg Schtofenmaher Supervisor: Walter Isaschar. Spring 2008 ( Part B). LOAD BALANCING SWITCH. General overview. Software solutions for real-time are too slow Power dissipation limits work frequencies - PowerPoint PPT PresentationTRANSCRIPT
1
LOAD BALANCING SWITCH
By: Maxim Fudim Oleg Schtofenmaher Supervisor: Walter Isaschar
FINAL PRESENTATION FOR PROJECT
Spring 2008 ( Part B)
2
General overview
Software solutions for real-time are too slow
Power dissipation limits work frequencies
Greater computing power neededH/W accelerators can improve S/W
processesMulti-core, multi-threaded systems
are the future
3
Multiprocessor environment for parallel processing of vectors data stream
Maximal ThroughputConfigurable hardwareExpandable designStatistics report
Project Goals
4
System specifications
SW over transparent HWInterface over PCI 1 Mbit/sec input streamVectors of 8 ÷ 1024 chunksVariable number of processorsSystem spreads over multiple FPGAs
5
Problem How to manage Data stream? How to manage multiple parallel units? How to achieve full and effective
utilization of resources?
6
Solution (Top Level)Board Level Load Balancing SwitchOne system input and output to
PCIDistribute vectors among classes Local buffers for chip data
7
Solution (Chip Level)
Chip Level Load Balancing SwitchConverting shared resources to
“personal” work space.Cluster ‘s organized VPUsMonitoring for each unit’s loadSmart arbitrationFlexible and easy configuration
8
Solution - Tree Distribution Switch
Class of Service Distribution
SW/HW interface
Clusters of VPUs
Clusters of VPUsClusters
of VPUs
LBS Arbitration
Clusters of VPUs
Clusters of VPUsClusters
of VPUs
LBS ArbitrationCluster
s of
VPUs
Clusters
of VPUs
Clusters
of VPUs
LBS Arbitration
9
Three level Architecture
Provide level for packets management ( Classes )Type, Size, Priority of Data
Provide level for organizing various processing units ( Clusters )Speed , Quantity, Resources of
Processors
Provide level for fine tuning ( VPUs ) Algorithm, HW accelerating
10
Implementation
11
Board Level
Multi chip systemLocal FIFOs for every chip/classClassifier for packet managementSW configurable controlsInput and Controls over Main BusOutput via streamed neighbored
busses
12
Board Overview
13
Busses Description
14S/W emulator or H/W DSP system
Board Level diagram
Input vectorsOutput reports
LBS1
Classifier
Stratix II 180
PROCStar IIPCI Bus
DDR2 DDR2
LBS2
DDR2 DDR2 DDR2 DDR2 DDR2 DDR2
LBS3 LBS4
Main Bus : Data In and Controls
Stratix II 180 Stratix II 180 Stratix II 180
Ring Bus
Ring Bus
Per LBS registers
15
Right BusReports
NIOS VPU
NIOS VPU
Single Chip diagram
Main BusInput Vectors
Load Balancing Switch
(LBS)
Left Bus Muxed Reports
NIOS VPU
DDR2 A FIFO IN
Data and Controls
Stratix II FPGA
DDR2 BFIFO OUT
NIOS VPU
Bus Control Block
16
PCI-System InterfacesSoftware - Hardware Interface:Input and Output MultiFIFO PCI data
busMultiFIFO statusLBS 1-4 Interface:2x32-bit general read purpose registers2x32-bit general write purpose registers8-bit information registerSoftware reset signal
17
PCI-System InterfacesClassifier:Global Configuration Register (32 bit)Global Info Register (32 bit)Global In Count Register (32 bit)Global Out Count Register (32 bit)Global Active Time Register (32 bit)Global Software reset signal
Board Level DescriptionClassifier (board level):Distributes data from Input PCI to
Local FIFOsHandles demands from Local Output
MastersSynchronize data and controlsConfigurable arbitration between
LBS classesConfigurable statistics gathering Timeout mechanism
Board Level DescriptionBusses Control Block (on every
chip):Parametric pins numberingMain /Ring Busses routingData samplingFIFO managementLocal Grant controlsLocal Output FIFO master
Main Bus InterfacesInput Data & Control Interface:Input data bus to Local FIFOsACK from Local FIFOsREQ to Local FIFOsStatistics REQ
21
Main Bus InterfacesOutput Controls Interface:Demand from Local FIFO MastersOutput Grant ACK from PCI FIFOEnd of vector from PCI FIFO Master
RING Bus InterfacesOutput Data Interface:Output data bus from Local FIFOs Data Valid from Local MastersEnd of output Vector from Local
MastersStatistics DataStatistics Valid
23
Chip Level
Local FIFO for inputs/outputsInternal clusters configurationArbitration, prioritiesStatistics, Synchronization
24
Single FPGA Top Diagram
Load Balancing
Switch
(LBS)
DDR2Controls Bank A
LBS 1-4
Stratix II 180 FPGA
DDR2 Controls
Bank B
I/O – LBSControl Block
Data flow
NIOScluster
NIOScluster
NIOScluster
NIOScluster
NIOScluster
NIOScluster
NIOScluster
NIOScluster
NIOScluster
NIOScluster
NIOScluster
NIOScluster
NIOScluster
NIOScluster
NIOScluster
NIOScluster
BusControl Block
25
Input System InterfaceLBS Input Interface:64 bit data bus from Input MultiFIFORead request and ack. SignalsMultiFIFO status flagsSW/HW input signals
26
Output System InterfaceLBS Output interface:64 bit data bus to Output MultiFIFOWrite request and ack. SignalsMultiFIFO status flagsSW/HW input signals
27
Data Packet Format
Header Data 1 to N of 32-bit
WordsTail
……
Unused
Nios Numb
er Data
Length NVector ID/Command
Type
8-bit 32-bit16-bitVersion 4-bit
SW/HW Control 1-bit
Type 1-bit(Data/
Command)
Tail : Sync Data
Header:
28
LBS Top Level ViewPC
I
Main Controller
unit
Stratix II FPGA
Output Writer
Cluster ArbiterNIOS II Syste
m
Input Reader
Cluster ArbiterNIOS II Syste
m
Control
Control
FIFO Input Port
FIFOOutput
Port
Control
Cluster ArbiterNIOS II Syste
mMuxed output data bus
Input data bus
Controland Status
Statistics
Reporter
Organization of VPU’s(Vector Processing Units)NIOS VPUs joined into the clustersConstant number of ClustersParametric number of NIOS VPU’s
in clusterParametric control logicVariable configuration of NIOS Different Priority for different
clusters
30
NIOS Input InterfaceHardware:64-bit input data bus – from LBS10 bit data slices counter – from
LBSWrite request signal – from LBSChip select signal – from LBSNIOS ready signal – from NIOSData ready signal – from LBS
31
NIOS Output InterfaceHardware:64 bit output data bus – from NIOS7 bit data slices counter – from LBSRead request signal – from LBSChip select signal – from LBSOutput ready signal – from NIOSOutput taken signal – from LBS
32
Twin VPU SystemInput / Output waveform
33
LBS Units DescriptionInput ReaderReading data from input FIFOWriting data to selected clusterProviding header control bits for
main controllerSynchronization checksVector length counterInput Time stamp
Main Controller unit
Output
Writer
Cluster
Arbiter
NIOS II
System
Input Reade
r Cluster
Arbiter
NIOS II
System
FIFO
Input
Port
FIFOOutput
Port
Cluster
Arbiter
NIOS II
SystemMuxed output data
bus
Input data bus
Controland Status
Statistics
Reporter
34
LBS Units Description Sync Flusher
Main Controller unit
Output
Writer
Cluster
Arbiter
NIOS II
System
Input Reade
r Cluster
Arbiter
NIOS II
System
FIFO
Input
Port
FIFOOutput
Port
Cluster
Arbiter
NIOS II
SystemMuxed output data
bus
Input data bus
Controland Status
Statistics
Reporter
Flush data on Input errorLook for Sync TailParametric number of recovery
triesFailure signal to Error Reporter
35
Input Reader Diagram
36
LBS Units DescriptionInput Controller - FSM
37
LBS Units DescriptionOutput WriterReading data from selected
clusterWriting data to output FIFOVector length counterOutput Time Stamp
Main Controller unit
Output
Writer
Cluster
Arbiter
NIOS II
System
Input Reade
r Cluster
Arbiter
NIOS II
System
FIFO
Input
Port
FIFOOutput
Port
Cluster
Arbiter
NIOS II
SystemMuxed output data
bus
Input data bus
Controland Status
StatisticsReporter
38
Output Writer Diagram
39
LBS Units DescriptionOutput Controller - FSM
40
LBS Units DescriptionMain ControllerEnabling input and output unitsSelecting control source (S/W or
H/W)Monitoring clusters’ load via
status busesSelecting clusters for input/output
operationsData validity indication
Main Controller unit
Output
Writer
Cluster
Arbiter
NIOS II
System
Input Reade
r Cluster
Arbiter
NIOS II
System
FIFO
Input
Port
FIFOOutput
Port
Cluster
Arbiter
NIOS II
SystemMuxed output data
bus
Input data bus
Controland Status
StatisticsReporter
41
Main ControllerStatus Decoders
42
Status input and output independent decoders
Static PriorityDynamic LoadParametric Aging mechanismRound Robin in same priority
group
LBS Units DescriptionMC Status Alghoritm
43
LBS Units DescriptionMC Status Alghoritm
11
1314
013
12
114 015
.
.
.
13|7 ... 013|0
1 2|1 0 1|7 00|0
014|3015|0
1 4|7 13|13
.
.
.
14|1213
Status input
Dynamic port
mapping
RR on Active ports
Next port
1011
1314
013
12
114 015
.
.
.
Static Priority/ Aging
mapping00111
202
10
44
Decoding Flow
45
LBS Units DescriptionStatistics ReporterMonitoring system activityError reporting for software Counting processed vectorsThroughput = Vectors served / Time
of service
Main Controller unit
Output
Writer
Cluster
Arbiter
NIOS II
System
Input Reade
r Cluster
Arbiter
NIOS II
System
FIFO
Input
Port
FIFOOutput
Port
Cluster
Arbiter
NIOS II
SystemMuxed output data
bus
Input data bus
Controland Status
StatisticsReporter
46
LBS Units DescriptionClusters Load Reporter
Monitoring clusters activityPer VPU active/free statusSending Status by request from
Classifier
Main Controller unit
Output
Writer
Cluster
Arbiter
NIOS II
System
Input Reade
r Cluster
Arbiter
NIOS II
System
FIFO
Input
Port
FIFOOutput
Port
Cluster
Arbiter
NIOS II
SystemMuxed output data
bus
Input data bus
Controland Status
StatisticsReporter
47
LBS Units DescriptionError Reporting Input Reader errorRecovery Synchronize FailurePackets DropsLocal Output Master’s ErrorLBS’s activity
Main Controller unit
Output
Writer
Cluster
Arbiter
NIOS II
System
Input Reade
r Cluster
Arbiter
NIOS II
System
FIFO
Input
Port
FIFOOutput
Port
Cluster
Arbiter
NIOS II
SystemMuxed output data
bus
Input data bus
Controland Status
StatisticsReporter
48
Cluster/VPUs parametric enabling
Cluster/VPUs status reportersVPUs Flow controllersWatchdogsNIOS Systems
LBS Units DescriptionCluster Entity
Main Controller unit
Output
Writer
Cluster
Arbiter
NIOS II
System
Input Reade
r Cluster
Arbiter
NIOS II
System
FIFO
Input
Port
FIFOOutput
Port
Cluster
Arbiter
NIOS II
SystemMuxed output data
bus
Input data bus
Controland Status
StatisticsReporter
49
LBS Units Description Cluster ConfigsDefine quantity of VPUs ports Define type of VPUs in clusterAutomatic creation of per VPU
control logicParametric arbiter for input/output
data
Main Controller unit
Output
Writer
Cluster
Arbiter
NIOS II
System
Input Reade
r Cluster
Arbiter
NIOS II
System
FIFO
Input
Port
FIFOOutput
Port
Cluster
Arbiter
NIOS II
SystemMuxed output data
bus
Input data bus
Controland Status
StatisticsReporter
50
LBS Units DescriptionPer Nios Structure
51
Input 4-phase REQ/ACK protocol with NIOSNios ReadyData Ready
Output 4-phase REQ/ACK protocol with NIOSOutput ReadyOutput Taken
Smart Status Reporter
LBS Units DescriptionVPU Controller
52
LBS Units DescriptionVPU Controller
53
VPU Input FSM
54
VPU Output FSM
55
Single processor with in/out buffers
HW accelerated systemShared resources system with
mutexMulti- processors system with
number of ports to Cluster
LBS Units DescriptionGeneral NIOS System
56
SOPC components:Nios II with custom algorithmProgram memory Input Vector Output VectorBuffersHW AcceleratorTimer
LBS Units DescriptionRafael’s basic NIOS System
57
Stub components: Input Vector Buffer Output Vector BufferFlow managementParametric delay for performance
analysis
LBS Units DescriptionDummy NIOS System
62
Resources &
Performance
63
Resource Usage
ModuleLogic
utilization
%Memor
y (M4K)
%
Peripheral IPs (MegaFIFO, PLLs, etc.) 3,100 2 16 2
User System (All VPUs + LBS) 42,000 30 675 88
Single VPU 6,775 4.7 112 15LBS Logic 1,350 1 3 0.5Total usage of chip resources 45,896 32 691 90
Total available 143,000 100 768 100
Resource usage data for 6 VPU system
VPU resource usage is based on basic VPUs and may be decreased by advanced configurations and policies.
64
Performance of LBSTheoretical Throughput:
100MHz x 64bit = 6.4Gbit/sArbitration and routing latency:
2-4 cycles in average60% Throughput for short vectors,
up to 95% for long vectorsPCI and slow algorithms =
bottlenecks1Mbit/s – 400 Mbit/s real
throughput
65
Performance for short vectors
SystemTime ofService[sec]
Throughput[Mbit/s] Impr
SW(on Core2Duo
E6600)0.1 3.2
6 VPUs 0.00209 122 382 Classes of 6 VPUs 0.00134 191 603 Classes of 6 VPUs 0.00086 297 93
4 Classes of 6 VPUs 0.00064 400 125
Time and throughput for 1000 vectors of 4 chunks each
VPU performance is based on basic VPUs and RR arbitration and may be increased for giving workload after perf. analysis by defining advanced configurations and policies.
66
Performance for medium vectors
SystemTime ofService[sec]
Throughput[Mbit/s] Impr
SW(on Core2Duo
E6600)2.9 2.3
6 VPUs 0.28 23.4 102 Classes of 6 VPUs 0.15 43.5 18.53 Classes of 6 VPUs 0.01 66 28.7
4 Classes of 6 VPUs 0.074 88 38
Time and throughput for 1000 vectors of 200 chunks each
VPU performance is based on basic VPUs and RR arbitration and may be increased for giving workload after perf. analysis by defining advanced configurations and policies.
67
Performance for long vectors
SystemTime ofService[sec]
Throughput[Mbit/s] Impr
SW(on Core2Duo
E6600)1.1 2.9
One VPU 1.224 2.62 0.896 VPUs 0.208 15.43 5.3
2 Classes of 6 VPUs 0.11 29.1 103 Classes of 6 VPUs 0.074 43.69 14.8
4 Classes of 6 VPUs 0.061 52.46 18
Time and throughput for 100 vectors of 1000 chunks each
VPU performance is based on basic VPUs and RR arbitration and may be increased for giving workload after perf. analysis by defining advanced configurations and policies.
68
System performance – missing TOAs
0 5 10 15 20 250.085
0.09
0.095
0.1
0.105
0.11
0.115
Number of missing TOAs
Proc
essi
ng t
ime
[sec
]
69
0 10 20 30 40 50 600.094
0.104
0.114
0.124
0.134
0.144
0.154
0.164
0.174
0.184
Noise [%]
Proc
essi
ng t
ime
[sec
]System performance – noise levels
70
Tasks (part A) Study relevant tools and environments
(GiDEL PROCWizard, API, Quartus, STP…) –Done Define interfaces with other groups –Done Define basic algorithm for h/w switching – Done Implement and debug the switch – Done Develop stubs for testing – Done Expand design for several NIOS’s – Done Integration with NIOS system – Done SW Test application for operating and
integration with hardware design – Done
71
Tasks (part B)
Increase number of Nios’s in clusters – Done
Improve algorithm for priority cluster selection – Done
Expand statistic reports – Done Expand SW/HW communication – Done Add error correction/handling – Done Spread design to several FPGAs – Done Complete integration with relevant
projects – Done
72
Summary
Flexible Architecture LBS with various SW/HW control and
statistics Up to 4 chip x 16 cluster x 32 VPU system Fully functional S/W – Board – LBS – NIOS
interfaces Successful hardware and software
integration Working design examples for other teams
73
Conclusions
Tree Switch concept is simple and efficient
Three layers abstraction concept = minimize changes
Buffers for every Class = independence
SAE for every VPU = balancing and performance
Single level of mastering = minimize resources
64-bit buses = maximize throughput Single data interface to SW =
bottleneck for high-speed designs
74
Conclusions (cont.)
Main bottlenecks are: PCI bus and VPU algorithm
Throughput varies between 400Mb/sec and 1Mb/sec (vector dependant)
The design complies with requirements
Further improvements in algorithm will speed up the system and increase num. of VPUs