load sharing for cluster-based network service

33
Load Sharing for Cluster-Based Network Service Jiani Guo and Laxmi Bhuyan Architecture Lab Department of Computer Science and Engineering University of California, Riverside

Upload: jubal

Post on 17-Jan-2016

35 views

Category:

Documents


1 download

DESCRIPTION

Load Sharing for Cluster-Based Network Service. Jiani Guo and Laxmi Bhuyan Architecture Lab Department of Computer Science and Engineering University of California, Riverside. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Load Sharing for  Cluster-Based Network Service

Load Sharing for Cluster-Based Network Service

Jiani Guo and Laxmi Bhuyan

Architecture LabDepartment of Computer Science and Engineering

University of California, Riverside

Page 2: Load Sharing for  Cluster-Based Network Service

2

Courtesy: “Cluster-Based Scalable Network Services”, Armando Fox, Steven D. Gribble, Yatin Chawathe, Eric A. Brewer and Paul Gauthier.

Page 3: Load Sharing for  Cluster-Based Network Service

3

Video on Demand System

Multimedia Grid

Media Server

Camera

Transcoding Service

Internet

a large number of clientsheterogeneity in clients’ inbound network bandwidth, CPU/MEM capacity or display resolutionStoring multiple copies in server will give rise to server overload and scalability problem

Page 4: Load Sharing for  Cluster-Based Network Service

4

Multimedia Grid

Media Server

Camera

Cluster-Based Transcoding Service

process the stream on-the-fly according to the client’s requirements => make some money

wide range of needs in video rates, sizes, and bandwidths can be met by real-time transcoding service – Need parallel processing

Transcoding Service

Page 5: Load Sharing for  Cluster-Based Network Service

5

Existing Load Balancing Schemes

Plethora of research in the field of load balancing, but most of them only did simulations

Random or Round-robin implemented in practice

Adaptive load balancing is desirable, but the overhead in collecting statistics is very high – we found no real implementation

How does one maintain QoS while doing load balancing? Ex: To reduce out-of-order departures of multimedia units, the GOPs must be assigned to one processor, when a good load balancing needs distribution of the workload

Page 6: Load Sharing for  Cluster-Based Network Service

6

EX: Round Robin

Worker 1

Dispatcher

Worker N

Worker 2

Unit Buffer

Receiver

.

.

.

Manager

fetch a unit Find an available

Worker

Send the unit

High communication protocol (UDP) overhead

Page 7: Load Sharing for  Cluster-Based Network Service

7

Round Robin – A Multithreaded Model to Reduce Communication Cost

Worker 1

Worker N

Worker 2

Unit Buffer

Receiver

.

.

.

Manager

fetch a unit Find an available

Worker

Send the unit

Dispatcher 1

Dispatcher M

.

.

.

Page 8: Load Sharing for  Cluster-Based Network Service

8

Load Sharing SchemesRound Robin - First Fit

Methods Searches for an available Worker in round robin way The first available Worker is chosen to be dispatched a GOP How a manager detects if a Worker is available is implementation-

dependent.

Properties Load is naturally balanced among all the Workers. Fast processing rate because no extra load analyzer is needed to

guide scheduling. May incur severe delay jitter for each stream because the GOPs of

the same stream are most likely to be distributed to different Workers.

Page 9: Load Sharing for  Cluster-Based Network Service

9

Round Robin – First Fit

Worker 1

Worker N

Worker 2

dispatch queues

Receiver .

.

.

Manager Node

.

.

.

scheduler

Dispatcher 1

Dispatcher 2

Dispatcher N

.

.

.

GOP Queue

Is the Worker available? = Is there a vacancy in the dispatch queue? Depends on power of the worker!

Page 10: Load Sharing for  Cluster-Based Network Service

10

Stream-based MappingMethods The media unit is mapped to a Worker according to the following

function: f ( C ) = C mod N

where C is the stream number to which the unit belongs; N is the total number of Workers in the cluster.

All media units belonging to one stream are sent to the same Worker.

Properties Preserves the order of computation among media units. Simple algorithm. Most efficient for some specific input patterns in a homogeneous cluster.

Specific patterns : M is multiple of N, where M is the total number of streams What if M < N?

Page 11: Load Sharing for  Cluster-Based Network Service

11

Adaptive Load Balancing - Least Load First

Feedback-based Scheme — Least Load First Efficient load test mechanism is needed for the Manager to monitor

load distribution in the cluster. Workers periodically report their load statistics information to the Manager.

The Worker with the least load is chosen to dispatch the job. May incur substantial overhead to implement the feedback

mechanism.

Each Worker reports to the Manager its load information during each epoch ∆t. Load information reported by each Worker

CPU utilization AUi(t)

Maximal possible throughput Ai(t)

Actual throughput: Ai(t) – N N is the number of outstanding requests, i.e., the number of GOPs

already dispatched to it but not yet completed

Manager chooses the least loaded Worker: Worker with the maximal actual throughput

Page 12: Load Sharing for  Cluster-Based Network Service

12

Adaptive Load SharingUnit-to-ComputingPC Mapping ( Done by the dispatcher)

Robust Hashing Mapping — The unit identifier ( such as the stream number of the unit in our experiment) and the Computing PC number together are used to assign a random value to each Computing PC. The unit is mapped to the Computing PC with the highest random value. If the Computing PCs have unequal capacity, the random value assigned to each Computing PC may be scaled by a weight which guarantees that the Computing PC with higher capacity can receive a proportionately higher portion of the load.

Thus, the mapping is calculated base on three values: the stream number of the unit C (1,2,…S), the Computing PC number J (1,2,…, N) and the weight vector (x1 , x2 , x3 , …, xN).

Minimize the probability of units belonging to the same stream being dispatched to different nodes. And this goal is achieved without keeping state information per stream.

Dynamic Weight Adaptation (Done by the manager) The workload on the Computing PCs (ρ1(t), ρ2(t), …, ρN(t)) is collected periodically

and the weight vector (x1 , x2 , x3 , …, xN) is adapted in a specific way such that the amount of stream re-mappings is minimized as well as load balancing is achieved.

The adapted weight vector is fed to the dispatchers.

Page 13: Load Sharing for  Cluster-Based Network Service

13

Adaptive Load Sharing

Manager

(x1 , x2 , x3 , …, xN)

Computing PC 1

Dispatcher 1

Computing PC N

Computing PC 2

Dispatcher M

Unit Buffe

rReceiver

.

.

.

.

.

.

.

Routing PC

ρ1(t)

ρN(t)

ρ2(t)

(x1 , x2 , x3 , …, xN)

Fetch a unit

F( C ) = J Send to node J

J available?

Start EndYes

No

Page 14: Load Sharing for  Cluster-Based Network Service

14

Experimental Set-Up

Computing PCComputing

PC

Computing PC

Gigabit Ethernet

Manager

Processed packets

Un-nrocessed packets

Media Server

100M Ethernet 100M Ethernet

Computing PC

Page 15: Load Sharing for  Cluster-Based Network Service

15

Transcoding Service

What is transoding? Transforming video/audio streams such as changing the bit-rate,

resizing video frames, and adjusting the frame resolution and so on.

How to transcode?

MPEG

Stream

Raw

Stream

Manipulated

Stream

MPEG

Stream

MPEG

Decoder

Video/Audio Frame

Manipulator

MPEG

Encoder

Page 16: Load Sharing for  Cluster-Based Network Service

16

Transcoding Workload

A media unit is a Group Of Pictures(GOP) of MPEG streamA media unit can be transcoded independently by any Worker in the cluster. Transcoding one media unit is considered an independent job.No communication is required among jobs.Each job consumes similar amount of processing time.Consecutive media units in a stream are preferred to be processed in order.

Page 17: Load Sharing for  Cluster-Based Network Service

17

Design Goals of the Load Sharing Schemes

Balance the transcoding workload among all Workers

High system throughput Low overhead taken by the load balancing algorithm itself Good tradeoff between computation and communication

Provide good Quality of Service - NEW In-order departure of media units Even output time interval among successive media units of a

media stream

Page 18: Load Sharing for  Cluster-Based Network Service

18

Computation Model of the Transcoding Cluster

Page 19: Load Sharing for  Cluster-Based Network Service

19

Manager NodeReceiver Thread

Accepts incoming media units into the GOP Queue

Scheduler Thread Fetches GOPs from the GOP queue and puts them into an appropriate

dispatch queue according to the specific load sharing scheme

Dispatcher Thread per Worker Each Dispatcher maintains a dispatch queue Once requested by the corresponding Worker, dispatches one GOP to the

Worker

Manager Thread — Only for Least Load First Scheme Collects the load statistics information from the Workers during each epoch Feeds the load information to the scheduler

Collector Thread Collects processed video units from Workers and sends them out

Page 20: Load Sharing for  Cluster-Based Network Service

20

Worker NodeReciever Thread

Receives packets from the Manager Node and assembles them into a complete GOP.

Once a complete GOP is received, gives it to the Transcoder thread, and then requests for another GOP from the Manager Node.

Transcoder Thread Transcode a GOP.

Sender Thread Delivers the transcoded GOPs to the clients.

Monitor Thread Collects the load statistics information on the Worker node and

reports to the Manager Node periodically.

Page 21: Load Sharing for  Cluster-Based Network Service

21

Scalability of the System

5 media streams

Page 22: Load Sharing for  Cluster-Based Network Service

22

Scalability of the System Throughput

System throughput scales well with First Fit and Least Load First.

Load test overhead in Least Load First scheme doesn’t affect the system throughput a lot, because the overhead is relatively small compared with the time taken to transcode one GOP.

Stream-based Mapping cannot disperse media units of the same stream among different Workers even if a Worker is free.

Waste of resources. Occasional imbalance in load distribution. Reduced throughput.

Page 23: Load Sharing for  Cluster-Based Network Service

23

Out-of-Order Rate per Stream

Page 24: Load Sharing for  Cluster-Based Network Service

24

Out-of-Order Rate per Stream

Out-of-order departure of media units Occurs when consecutive GOPs of a stream are transcoded on different Workers

The worklaod on different Workes is different Different media units consume different amount of computation

time

Stream-based Mapping eliminates out-of-order departure of media units.

Largest OFO rate for First Fit.

Least Load First improves 50% over First Fit.

Page 25: Load Sharing for  Cluster-Based Network Service

25

Output Time Interval (OTI) per Stream

Page 26: Load Sharing for  Cluster-Based Network Service

26

Output Time Interval(OTI) per Stream

Experiment setting 4 homogeneous Workers, 5 media streams

First Fit achieves the best performance.

Least Load First approaches First Fit.

Longer delay for Stream-based Mapping because of the limitation that one stream can only be processed by one Worker.

Page 27: Load Sharing for  Cluster-Based Network Service

27

Load Sharing Overhead with LLF

Load Test Overhead Average time consumed by the Manager node to poll through all

Workers to collect the load statistics information.

Load Remapping Overhead Time used to set the current loads for each Worker.

Cluster Size 1 2 3 4

Load Test Overhead ( msecs)

0.87 1.6 2.3 3.0

Cluster Size 2 3 4

Load Remapping Overhead (usecs)

4.2 4.5 4.8

Page 28: Load Sharing for  Cluster-Based Network Service

28

Load Sharing Overhead with LLF

Load test overhead increases roughly proportional to the cluster size.

Load re-mapping overhead is much smaller than the load test overhead, almost negligible.

The operation overhead involved in load re-mapping is much less than the network communication overhead involved in load test.

Page 29: Load Sharing for  Cluster-Based Network Service

29

Evolution of Average Workload on Each Node

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

10 20 30 40 50 60 70

Time (sec)

Lo

ad

(%

)

node1

node2

node3

Page 30: Load Sharing for  Cluster-Based Network Service

30

Load Sharing Schemes

How to take QoS into consideration?

Transcoding PC 1

Scheduler

Transcoding PC N

Transcoding PC 2

Unit Buffer

Receiver

.

.

.

fetch a unit Find an available

Computing PC

Send the unit

SchedulePC

Page 31: Load Sharing for  Cluster-Based Network Service

31

Differentiated Service(Fair Scheduling)

A system is said to be capable of affording differentiated service among service classes if

The system permits its resources to be proportioned among the service classes

Given sufficient request load, a service class receives at least as much resources as were assigned to it irrespective of the load on other service classes

Resources not used by some service class may be distributed among other service classes.

Page 32: Load Sharing for  Cluster-Based Network Service

32

Framework of Fair Scheduling

Page 33: Load Sharing for  Cluster-Based Network Service

33

Fair Scheduleing Fairly distribute resource among streams

Streams make reservations Received service is proportional to the reservations

UnitScheduler - Weighted Round Robin ( WRR ) Provide differentiated service rate to multiple streams Weights in each round-robin cycle are dynamically adapted to achieve

the best performance Weight of stream i Wi(t): the number of GOPs scheduled for stream i

during one round robin cycle