load balancing switch

70
LOAD BALANCING SWITCH By: Maxim Fudim Oleg Schtofenmaher Supervisor: Walter Isaschar FINAL PRESENTATION FOR PROJECT Spring 2008 ( Part B) 1

Upload: aviva

Post on 24-Feb-2016

26 views

Category:

Documents


0 download

DESCRIPTION

Final presentation for project. By: Maxim Fudim Oleg Schtofenmaher Supervisor: Walter Isaschar. Spring 2008 ( Part B). LOAD BALANCING SWITCH. General overview. Software solutions for real-time are too slow Power dissipation limits work frequencies - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: LOAD BALANCING SWITCH

1

LOAD BALANCING SWITCH

By: Maxim Fudim Oleg Schtofenmaher Supervisor: Walter Isaschar

FINAL PRESENTATION FOR PROJECT

Spring 2008 ( Part B)

Page 2: LOAD BALANCING SWITCH

2

General overview

Software solutions for real-time are too slow

Power dissipation limits work frequencies

Greater computing power neededH/W accelerators can improve S/W

processesMulti-core, multi-threaded systems

are the future

Page 3: LOAD BALANCING SWITCH

3

Multiprocessor environment for parallel processing of vectors data stream

Maximal ThroughputConfigurable hardwareExpandable designStatistics report

Project Goals

Page 4: LOAD BALANCING SWITCH

4

System specifications

SW over transparent HWInterface over PCI 1 Mbit/sec input streamVectors of 8 ÷ 1024 chunksVariable number of processorsSystem spreads over multiple FPGAs

Page 5: LOAD BALANCING SWITCH

5

Problem How to manage Data stream? How to manage multiple parallel units? How to achieve full and effective

utilization of resources?

Page 6: LOAD BALANCING SWITCH

6

Solution (Top Level)Board Level Load Balancing SwitchOne system input and output to

PCIDistribute vectors among classes Local buffers for chip data

Page 7: LOAD BALANCING SWITCH

7

Solution (Chip Level)

Chip Level Load Balancing SwitchConverting shared resources to

“personal” work space.Cluster ‘s organized VPUsMonitoring for each unit’s loadSmart arbitrationFlexible and easy configuration

Page 8: LOAD BALANCING SWITCH

8

Solution - Tree Distribution Switch

Class of Service Distribution

SW/HW interface

Clusters of VPUs

Clusters of VPUsClusters

of VPUs

LBS Arbitration

Clusters of VPUs

Clusters of VPUsClusters

of VPUs

LBS ArbitrationCluster

s of

VPUs

Clusters

of VPUs

Clusters

of VPUs

LBS Arbitration

Page 9: LOAD BALANCING SWITCH

9

Three level Architecture

Provide level for packets management ( Classes )Type, Size, Priority of Data

Provide level for organizing various processing units ( Clusters )Speed , Quantity, Resources of

Processors

Provide level for fine tuning ( VPUs ) Algorithm, HW accelerating

Page 10: LOAD BALANCING SWITCH

10

Implementation

Page 11: LOAD BALANCING SWITCH

11

Board Level

Multi chip systemLocal FIFOs for every chip/classClassifier for packet managementSW configurable controlsInput and Controls over Main BusOutput via streamed neighbored

busses

Page 12: LOAD BALANCING SWITCH

12

Board Overview

Page 13: LOAD BALANCING SWITCH

13

Busses Description

Page 14: LOAD BALANCING SWITCH

14S/W emulator or H/W DSP system

Board Level diagram

Input vectorsOutput reports

LBS1

Classifier

Stratix II 180

PROCStar IIPCI Bus

DDR2 DDR2

LBS2

DDR2 DDR2 DDR2 DDR2 DDR2 DDR2

LBS3 LBS4

Main Bus : Data In and Controls

Stratix II 180 Stratix II 180 Stratix II 180

Ring Bus

Ring Bus

Per LBS registers

Page 15: LOAD BALANCING SWITCH

15

Right BusReports

NIOS VPU

NIOS VPU

Single Chip diagram

Main BusInput Vectors

Load Balancing Switch

(LBS)

Left Bus Muxed Reports

NIOS VPU

DDR2 A FIFO IN

Data and Controls

Stratix II FPGA

DDR2 BFIFO OUT

NIOS VPU

Bus Control Block

Page 16: LOAD BALANCING SWITCH

16

PCI-System InterfacesSoftware - Hardware Interface:Input and Output MultiFIFO PCI data

busMultiFIFO statusLBS 1-4 Interface:2x32-bit general read purpose registers2x32-bit general write purpose registers8-bit information registerSoftware reset signal

Page 17: LOAD BALANCING SWITCH

17

PCI-System InterfacesClassifier:Global Configuration Register (32 bit)Global Info Register (32 bit)Global In Count Register (32 bit)Global Out Count Register (32 bit)Global Active Time Register (32 bit)Global Software reset signal

Page 18: LOAD BALANCING SWITCH

Board Level DescriptionClassifier (board level):Distributes data from Input PCI to

Local FIFOsHandles demands from Local Output

MastersSynchronize data and controlsConfigurable arbitration between

LBS classesConfigurable statistics gathering Timeout mechanism

Page 19: LOAD BALANCING SWITCH

Board Level DescriptionBusses Control Block (on every

chip):Parametric pins numberingMain /Ring Busses routingData samplingFIFO managementLocal Grant controlsLocal Output FIFO master

Page 20: LOAD BALANCING SWITCH

Main Bus InterfacesInput Data & Control Interface:Input data bus to Local FIFOsACK from Local FIFOsREQ to Local FIFOsStatistics REQ

Page 21: LOAD BALANCING SWITCH

21

Main Bus InterfacesOutput Controls Interface:Demand from Local FIFO MastersOutput Grant ACK from PCI FIFOEnd of vector from PCI FIFO Master

Page 22: LOAD BALANCING SWITCH

RING Bus InterfacesOutput Data Interface:Output data bus from Local FIFOs Data Valid from Local MastersEnd of output Vector from Local

MastersStatistics DataStatistics Valid

Page 23: LOAD BALANCING SWITCH

23

Chip Level

Local FIFO for inputs/outputsInternal clusters configurationArbitration, prioritiesStatistics, Synchronization

Page 24: LOAD BALANCING SWITCH

24

Single FPGA Top Diagram

Load Balancing

Switch

(LBS)

DDR2Controls Bank A

LBS 1-4

Stratix II 180 FPGA

DDR2 Controls

Bank B

I/O – LBSControl Block

Data flow

NIOScluster

NIOScluster

NIOScluster

NIOScluster

NIOScluster

NIOScluster

NIOScluster

NIOScluster

NIOScluster

NIOScluster

NIOScluster

NIOScluster

NIOScluster

NIOScluster

NIOScluster

NIOScluster

BusControl Block

Page 25: LOAD BALANCING SWITCH

25

Input System InterfaceLBS Input Interface:64 bit data bus from Input MultiFIFORead request and ack. SignalsMultiFIFO status flagsSW/HW input signals

Page 26: LOAD BALANCING SWITCH

26

Output System InterfaceLBS Output interface:64 bit data bus to Output MultiFIFOWrite request and ack. SignalsMultiFIFO status flagsSW/HW input signals

Page 27: LOAD BALANCING SWITCH

27

Data Packet Format

Header Data 1 to N of 32-bit

WordsTail

……

Unused

Nios Numb

er Data

Length NVector ID/Command

Type

8-bit 32-bit16-bitVersion 4-bit

SW/HW Control 1-bit

Type 1-bit(Data/

Command)

Tail : Sync Data

Header:

Page 28: LOAD BALANCING SWITCH

28

LBS Top Level ViewPC

I

Main Controller

unit

Stratix II FPGA

Output Writer

Cluster ArbiterNIOS II Syste

m

Input Reader

Cluster ArbiterNIOS II Syste

m

Control

Control

FIFO Input Port

FIFOOutput

Port

Control

Cluster ArbiterNIOS II Syste

mMuxed output data bus

Input data bus

Controland Status

Statistics

Reporter

Page 29: LOAD BALANCING SWITCH

Organization of VPU’s(Vector Processing Units)NIOS VPUs joined into the clustersConstant number of ClustersParametric number of NIOS VPU’s

in clusterParametric control logicVariable configuration of NIOS Different Priority for different

clusters

Page 30: LOAD BALANCING SWITCH

30

NIOS Input InterfaceHardware:64-bit input data bus – from LBS10 bit data slices counter – from

LBSWrite request signal – from LBSChip select signal – from LBSNIOS ready signal – from NIOSData ready signal – from LBS

Page 31: LOAD BALANCING SWITCH

31

NIOS Output InterfaceHardware:64 bit output data bus – from NIOS7 bit data slices counter – from LBSRead request signal – from LBSChip select signal – from LBSOutput ready signal – from NIOSOutput taken signal – from LBS

Page 32: LOAD BALANCING SWITCH

32

Twin VPU SystemInput / Output waveform

Page 33: LOAD BALANCING SWITCH

33

LBS Units DescriptionInput ReaderReading data from input FIFOWriting data to selected clusterProviding header control bits for

main controllerSynchronization checksVector length counterInput Time stamp

Main Controller unit

Output

Writer

Cluster

Arbiter

NIOS II

System

Input Reade

r Cluster

Arbiter

NIOS II

System

FIFO

Input

Port

FIFOOutput

Port

Cluster

Arbiter

NIOS II

SystemMuxed output data

bus

Input data bus

Controland Status

Statistics

Reporter

Page 34: LOAD BALANCING SWITCH

34

LBS Units Description Sync Flusher

Main Controller unit

Output

Writer

Cluster

Arbiter

NIOS II

System

Input Reade

r Cluster

Arbiter

NIOS II

System

FIFO

Input

Port

FIFOOutput

Port

Cluster

Arbiter

NIOS II

SystemMuxed output data

bus

Input data bus

Controland Status

Statistics

Reporter

Flush data on Input errorLook for Sync TailParametric number of recovery

triesFailure signal to Error Reporter

Page 35: LOAD BALANCING SWITCH

35

Input Reader Diagram

Page 36: LOAD BALANCING SWITCH

36

LBS Units DescriptionInput Controller - FSM

Page 37: LOAD BALANCING SWITCH

37

LBS Units DescriptionOutput WriterReading data from selected

clusterWriting data to output FIFOVector length counterOutput Time Stamp

Main Controller unit

Output

Writer

Cluster

Arbiter

NIOS II

System

Input Reade

r Cluster

Arbiter

NIOS II

System

FIFO

Input

Port

FIFOOutput

Port

Cluster

Arbiter

NIOS II

SystemMuxed output data

bus

Input data bus

Controland Status

StatisticsReporter

Page 38: LOAD BALANCING SWITCH

38

Output Writer Diagram

Page 39: LOAD BALANCING SWITCH

39

LBS Units DescriptionOutput Controller - FSM

Page 40: LOAD BALANCING SWITCH

40

LBS Units DescriptionMain ControllerEnabling input and output unitsSelecting control source (S/W or

H/W)Monitoring clusters’ load via

status busesSelecting clusters for input/output

operationsData validity indication

Main Controller unit

Output

Writer

Cluster

Arbiter

NIOS II

System

Input Reade

r Cluster

Arbiter

NIOS II

System

FIFO

Input

Port

FIFOOutput

Port

Cluster

Arbiter

NIOS II

SystemMuxed output data

bus

Input data bus

Controland Status

StatisticsReporter

Page 41: LOAD BALANCING SWITCH

41

Main ControllerStatus Decoders

Page 42: LOAD BALANCING SWITCH

42

Status input and output independent decoders

Static PriorityDynamic LoadParametric Aging mechanismRound Robin in same priority

group

LBS Units DescriptionMC Status Alghoritm

Page 43: LOAD BALANCING SWITCH

43

LBS Units DescriptionMC Status Alghoritm

11

1314

013

12

114 015

.

.

.

13|7 ... 013|0

1 2|1 0 1|7 00|0

014|3015|0

1 4|7 13|13

.

.

.

14|1213

Status input

Dynamic port

mapping

RR on Active ports

Next port

1011

1314

013

12

114 015

.

.

.

Static Priority/ Aging

mapping00111

202

10

Page 44: LOAD BALANCING SWITCH

44

Decoding Flow

Page 45: LOAD BALANCING SWITCH

45

LBS Units DescriptionStatistics ReporterMonitoring system activityError reporting for software Counting processed vectorsThroughput = Vectors served / Time

of service

Main Controller unit

Output

Writer

Cluster

Arbiter

NIOS II

System

Input Reade

r Cluster

Arbiter

NIOS II

System

FIFO

Input

Port

FIFOOutput

Port

Cluster

Arbiter

NIOS II

SystemMuxed output data

bus

Input data bus

Controland Status

StatisticsReporter

Page 46: LOAD BALANCING SWITCH

46

LBS Units DescriptionClusters Load Reporter

Monitoring clusters activityPer VPU active/free statusSending Status by request from

Classifier

Main Controller unit

Output

Writer

Cluster

Arbiter

NIOS II

System

Input Reade

r Cluster

Arbiter

NIOS II

System

FIFO

Input

Port

FIFOOutput

Port

Cluster

Arbiter

NIOS II

SystemMuxed output data

bus

Input data bus

Controland Status

StatisticsReporter

Page 47: LOAD BALANCING SWITCH

47

LBS Units DescriptionError Reporting Input Reader errorRecovery Synchronize FailurePackets DropsLocal Output Master’s ErrorLBS’s activity

Main Controller unit

Output

Writer

Cluster

Arbiter

NIOS II

System

Input Reade

r Cluster

Arbiter

NIOS II

System

FIFO

Input

Port

FIFOOutput

Port

Cluster

Arbiter

NIOS II

SystemMuxed output data

bus

Input data bus

Controland Status

StatisticsReporter

Page 48: LOAD BALANCING SWITCH

48

Cluster/VPUs parametric enabling

Cluster/VPUs status reportersVPUs Flow controllersWatchdogsNIOS Systems

LBS Units DescriptionCluster Entity

Main Controller unit

Output

Writer

Cluster

Arbiter

NIOS II

System

Input Reade

r Cluster

Arbiter

NIOS II

System

FIFO

Input

Port

FIFOOutput

Port

Cluster

Arbiter

NIOS II

SystemMuxed output data

bus

Input data bus

Controland Status

StatisticsReporter

Page 49: LOAD BALANCING SWITCH

49

LBS Units Description Cluster ConfigsDefine quantity of VPUs ports Define type of VPUs in clusterAutomatic creation of per VPU

control logicParametric arbiter for input/output

data

Main Controller unit

Output

Writer

Cluster

Arbiter

NIOS II

System

Input Reade

r Cluster

Arbiter

NIOS II

System

FIFO

Input

Port

FIFOOutput

Port

Cluster

Arbiter

NIOS II

SystemMuxed output data

bus

Input data bus

Controland Status

StatisticsReporter

Page 50: LOAD BALANCING SWITCH

50

LBS Units DescriptionPer Nios Structure

Page 51: LOAD BALANCING SWITCH

51

Input 4-phase REQ/ACK protocol with NIOSNios ReadyData Ready

Output 4-phase REQ/ACK protocol with NIOSOutput ReadyOutput Taken

Smart Status Reporter

LBS Units DescriptionVPU Controller

Page 52: LOAD BALANCING SWITCH

52

LBS Units DescriptionVPU Controller

Page 53: LOAD BALANCING SWITCH

53

VPU Input FSM

Page 54: LOAD BALANCING SWITCH

54

VPU Output FSM

Page 55: LOAD BALANCING SWITCH

55

Single processor with in/out buffers

HW accelerated systemShared resources system with

mutexMulti- processors system with

number of ports to Cluster

LBS Units DescriptionGeneral NIOS System

Page 56: LOAD BALANCING SWITCH

56

SOPC components:Nios II with custom algorithmProgram memory Input Vector Output VectorBuffersHW AcceleratorTimer

LBS Units DescriptionRafael’s basic NIOS System

Page 57: LOAD BALANCING SWITCH

57

Stub components: Input Vector Buffer Output Vector BufferFlow managementParametric delay for performance

analysis

LBS Units DescriptionDummy NIOS System

Page 58: LOAD BALANCING SWITCH

62

Resources &

Performance

Page 59: LOAD BALANCING SWITCH

63

Resource Usage

ModuleLogic

utilization

%Memor

y (M4K)

%

Peripheral IPs (MegaFIFO, PLLs, etc.) 3,100 2 16 2

User System (All VPUs + LBS) 42,000 30 675 88

Single VPU 6,775 4.7 112 15LBS Logic 1,350 1 3 0.5Total usage of chip resources 45,896 32 691 90

Total available 143,000 100 768 100

Resource usage data for 6 VPU system

VPU resource usage is based on basic VPUs and may be decreased by advanced configurations and policies.

Page 60: LOAD BALANCING SWITCH

64

Performance of LBSTheoretical Throughput:

100MHz x 64bit = 6.4Gbit/sArbitration and routing latency:

2-4 cycles in average60% Throughput for short vectors,

up to 95% for long vectorsPCI and slow algorithms =

bottlenecks1Mbit/s – 400 Mbit/s real

throughput

Page 61: LOAD BALANCING SWITCH

65

Performance for short vectors

SystemTime ofService[sec]

Throughput[Mbit/s] Impr

SW(on Core2Duo

E6600)0.1 3.2

6 VPUs 0.00209 122 382 Classes of 6 VPUs 0.00134 191 603 Classes of 6 VPUs 0.00086 297 93

4 Classes of 6 VPUs 0.00064 400 125

Time and throughput for 1000 vectors of 4 chunks each

VPU performance is based on basic VPUs and RR arbitration and may be increased for giving workload after perf. analysis by defining advanced configurations and policies.

Page 62: LOAD BALANCING SWITCH

66

Performance for medium vectors

SystemTime ofService[sec]

Throughput[Mbit/s] Impr

SW(on Core2Duo

E6600)2.9 2.3

6 VPUs 0.28 23.4 102 Classes of 6 VPUs 0.15 43.5 18.53 Classes of 6 VPUs 0.01 66 28.7

4 Classes of 6 VPUs 0.074 88 38

Time and throughput for 1000 vectors of 200 chunks each

VPU performance is based on basic VPUs and RR arbitration and may be increased for giving workload after perf. analysis by defining advanced configurations and policies.

Page 63: LOAD BALANCING SWITCH

67

Performance for long vectors

SystemTime ofService[sec]

Throughput[Mbit/s] Impr

SW(on Core2Duo

E6600)1.1 2.9

One VPU 1.224 2.62 0.896 VPUs 0.208 15.43 5.3

2 Classes of 6 VPUs 0.11 29.1 103 Classes of 6 VPUs 0.074 43.69 14.8

4 Classes of 6 VPUs 0.061 52.46 18

Time and throughput for 100 vectors of 1000 chunks each

VPU performance is based on basic VPUs and RR arbitration and may be increased for giving workload after perf. analysis by defining advanced configurations and policies.

Page 64: LOAD BALANCING SWITCH

68

System performance – missing TOAs

0 5 10 15 20 250.085

0.09

0.095

0.1

0.105

0.11

0.115

Number of missing TOAs

Proc

essi

ng t

ime

[sec

]

Page 65: LOAD BALANCING SWITCH

69

0 10 20 30 40 50 600.094

0.104

0.114

0.124

0.134

0.144

0.154

0.164

0.174

0.184

Noise [%]

Proc

essi

ng t

ime

[sec

]System performance – noise levels

Page 66: LOAD BALANCING SWITCH

70

Tasks (part A) Study relevant tools and environments

(GiDEL PROCWizard, API, Quartus, STP…) –Done Define interfaces with other groups –Done Define basic algorithm for h/w switching – Done Implement and debug the switch – Done Develop stubs for testing – Done Expand design for several NIOS’s – Done Integration with NIOS system – Done SW Test application for operating and

integration with hardware design – Done

Page 67: LOAD BALANCING SWITCH

71

Tasks (part B)

Increase number of Nios’s in clusters – Done

Improve algorithm for priority cluster selection – Done

Expand statistic reports – Done Expand SW/HW communication – Done Add error correction/handling – Done Spread design to several FPGAs – Done Complete integration with relevant

projects – Done

Page 68: LOAD BALANCING SWITCH

72

Summary

Flexible Architecture LBS with various SW/HW control and

statistics Up to 4 chip x 16 cluster x 32 VPU system Fully functional S/W – Board – LBS – NIOS

interfaces Successful hardware and software

integration Working design examples for other teams

Page 69: LOAD BALANCING SWITCH

73

Conclusions

Tree Switch concept is simple and efficient

Three layers abstraction concept = minimize changes

Buffers for every Class = independence

SAE for every VPU = balancing and performance

Single level of mastering = minimize resources

64-bit buses = maximize throughput Single data interface to SW =

bottleneck for high-speed designs

Page 70: LOAD BALANCING SWITCH

74

Conclusions (cont.)

Main bottlenecks are: PCI bus and VPU algorithm

Throughput varies between 400Mb/sec and 1Mb/sec (vector dependant)

The design complies with requirements

Further improvements in algorithm will speed up the system and increase num. of VPUs