application behavior-aware flow control in network-on-chip

49
Application Behavior-aware Flow Control in Network-on-Chip Advisor: Chung-Ta King Student: Huan-Yu Liu Department of Computer Science National Tsing Hua University Hsinchu, Taiwan 30013 R.O.C. July, 2010

Upload: ivonne-liu

Post on 16-Jul-2015

45 views

Category:

Documents


2 download

TRANSCRIPT

Application Behavior-aware Flow Controlin Network-on-Chip

Advisor: Chung-Ta KingStudent: Huan-Yu Liu

Department of Computer ScienceNational Tsing Hua University

Hsinchu, Taiwan 30013R.O.C.

July, 2010

Abstract

Multicore might be the only solution when concerning about performance and

power issues in future chip processor architecture. As the number of cores on

a chip keeps on increasing, traditional bus-based architectures are incapable of

offering the required communication bandwidth on the chip, so Network-on-chip

(NoC) becomes the main paradigm for on-chip interconnection. NoCs not only

offer significant bandwidth advantages but also provide outstanding flexibility.

However, the performance of NoCs can be degraded significantly if the network

flow is not controlled properly. Most previous solutions try to detect network

congestion by monitoring the hardware status of the network switches or links.

Change of hardware statuses at local end may indicate possible congestions in the

network, and thus packet injection into the network should be controlled to react

to the congestions. The problem with these solutions is that congestion detection

is based only on local status without global information. Actual congestions may

occur somewhere else and can only be detected through backpressure, which may

be too passive and too slow for taking reactive measures in time.

This work takes a proactive approach for congestion detection. The idea is to

predict the changes in global, end-to-end network traffic patterns of the running

1

application and take proactive flow control actions to avoid possible congestions.

Traffic prediction is based on our recent paper [1], which uses a table-driven

predictor for predicting application communication patterns. In this thesis, we

discuss how to use the prediction results for effective scheduling of packet injec-

tion to avoid network congestions and improve the throughput. The proposed

scheme is evaluated using simulation based on a SPLASH-2 benchmark as well

as synthetic traffic. The results show its superior performance improvement and

negligible execution overhead.

i

摘要

當考慮到在將來的晶片處理器架構的效能跟電的議題時,多核心可能是唯一

的解決方法。當晶片上的核心數量一直不斷增加時,傳統的以匯流排為主的架構

已經不能滿足晶片上需要的傳輸頻寬,而晶片網路(NoC)就成為晶片上互連傳輸

的主流。晶片網路不只提供可觀的頻寬的優點,也展現出它很傑出的彈性。然而,

假如網路的流量不能被適當的控管晶片,網路效能會大大的降低。大部份以前的

解法是藉著偵測網路上的交換器跟連結的硬體狀態嘗試去偵測網路擁塞。這些局

部端點的硬體狀態的改變可以指出網路上可能會發生的壅塞,再藉由壅塞的狀態

去控制網路的封包注入。這些偵測網路壅塞的方法是只有看局部的硬體狀態,並

沒有考慮整個網路的情況,實際上網路的壅塞不是只會發生在局部的地方,而是

可能會發生在網路的其他地方。而且用硬體狀態偵測網路壅塞的方法,是一種回

壓的機制,這是一個很被動也太慢的方法,並不能及時的反應真正的網路壅塞。

這篇論文就採用一個比較前瞻性的方法來偵測網路壅塞。概念就是去預測正

在執行的應用程式的網路流量模型,這是一種看總體網路的點對點傳輸的方式,

藉由這個預測方法做流量控制來避免網路壅塞。網路流量的預測是以一篇最近的

論文當作基礎,它的手法是用表來紀錄應用程式傳輸模型而達到預測的目的。在

這篇論文,我們討論到如何用這些預測出來的結果來做有效的封包注入的行程,

以避免網路壅塞而且也能提高總體的處理能力。我們提出的這個系統是使用

SPLASH-2來評估我們的模擬,另外也用了合成的網路流量來作實驗。這些實驗

結果可以看出我們大大的增進整體的效能,而且總體執行時間也有些微的減少。

誌謝

假如要問我,兩年的碩士研究生活跟四年的大學生活的充實程度那我一定毫

不猶豫選擇前者。大學四年要過的生活無非就是上課,唸書跟考試。但上研究所

就不一樣了,除了修課唸書跟考試之外,還要忙著做計劃跟找論文的研究題目。

說到找論文題目的過程,真的是一波三折啊,換了又換,總是找不到一個合適的

題目。最後總算找到論文題目時,卻總是活在不能如期完成的恐慌跟壓力下。然

而,最後還是順利完全這個題目,做出一個成果,在這個研究的過程中雖充滿著

挫折,卻也是學到很多,不管是在研究上或精神上。

第一個要感謝的就是我的指導教授金仲達老師,給予我的論文許多有用的寶

貴意見,指出我的盲點,讓我的研究可以順利進行,所以萬分的感謝老師。

還有要感謝 Multi-core的各位大大,尤其是有希學長跟布拉胖學姐,給我

論文很多實質的幫助,學長更是常常不厭其煩的跟我討論,是我論文能順利完成

的一大推手。另外還有 kaven 大大,你的強度是讓我望塵莫及,我也常常拿論文

的東西來跟你討論,讓我收穫良多,還有你的嘴砲也是常給我們帶來歡樂。雙翔

學弟更是一對活寶,更是讓我相信我們 Multi-core組真的是由奇人異士組成

的。

感謝 PADS的每個博班學長和學姐還有同儕和其他的學弟,雖然我們研究領

域不同,但是因為有你們,我們 PADS 總是充滿著歡樂,讓我雖處在畢業的壓力,

心情還是蠻愉快的,所以也要感謝你們。

最後,感謝我的家人還有我的知心朋友們,你們才是我最大的動力。

Contents

1 Introduction 1

2 Motivating Example 6

3 Related Work 10

4 Problem Formulation 12

4.1 Application-Driven Predictor . . . . . . . . . . . . . . . . . . . . . . 14

5 Traffic Control Algorithm 19

5.1 Traffic Control Algorithm and Implementation Overhead . . . . . . 19

5.2 Data Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5.3 Area Occupancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

6 Experimental Results 25

6.1 Simulation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

6.2 Real Application Traffic . . . . . . . . . . . . . . . . . . . . . . . . . . 26

ii

6.3 Synthetic Traffic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

7 Conclusion and Future Works 32

iii

List of Tables

6.1 Simulation Configuration . . . . . . . . . . . . . . . . . . . . . . . . . 26

6.2 Our proposed flow control algorithm leads to the huge reduction

in the latency and slight execution time overhead. . . . . . . . . . . 27

6.3 Our proposed flow control algorithm for synthetic traffic leads

to the huge reduction in the average latency and the maximum

latency and slight reduction in the execution time. . . . . . . . . . 31

iv

List of Figures

2.1 The tile arrangement and interconnection topology used for ex-

periment on TILE64 platform . . . . . . . . . . . . . . . . . . . . . . 7

2.2 The traffic of router 4 is tracked. The first diagram is all the traffic

input/output from router 4. The second to the fourth diagrams

show the decomposed traffic. Note that the traffic relayed by

router 4 is omitted. The last one is the output traffic from router

4 to 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4.1 The structure of a router . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.2 A example of a L1-table. The columns G4 : G0 record the quantized

transmitted data size of the last 5 time intervals. . . . . . . . . . . . 14

4.3 A example of a L2-table which is indexed by the transmission

history pattern G4 : G0. The corresponding data size level Gp is

the value predicted to transmit in the next time interval . . . . . . 15

4.4 A table which records the delayed transmissions . . . . . . . . . . . 15

5.1 The diagram of the flow control algorithm . . . . . . . . . . . . . . . 21

v

5.2 The diagram of flow control . . . . . . . . . . . . . . . . . . . . . . . 24

6.1 Histograms of the packet latencies without (a) and with (b) the

proposed flow control and in (b) the latencies slow down drastically. 28

6.2 The maximum workload of links in the network without (a) and

with (b) the proposed flow control. . . . . . . . . . . . . . . . . . . . 29

vi

Chapter 1

Introduction

The number of transistors on a chip has increased exponentially over the decades

according to the Moore’s Law. At the same time, applications, such as process-

ing, have also increased in complexities and therefore require huge computations.

These factors are further coupled with the increasing need for power saving is

also as the clock frequency of a core increases. The best practice at this is

to go for multicore architecture and application parallelization. However, the

communication overhead would be a critical bottleneck if we cannot offer sub-

stantial bandwidth among the core. Traditional bus-based architectures suffer

from increased packet latencies as the number of cores on-chip increases and are

incapable of providing performance guarantees especially for real-time applica-

tions. As a result, Network-on-chip (NoC) becomes a de facto solution to handle

this critical problem.

NoCs not only offer significant bandwidth but also provide outstanding flex-

1

ibility and scalability. There have already had multi- and many-core processors

at the market that adopt NoC as their communication fabric. For example,

Tilera’s TILE64 [2] introduced in 2007 uses a 2-D mesh-based network to inter-

connect 64 tiles and 4 memory controllers. Indeed, NoCs are becoming the main

communication and design fabric for chip-level multiprocessors.

Since the cores are connected by a network, the flow control and congestion

control in NoC are certainly important issues. If a core transmits too many

packets to another core, the intermediate routers need to buffer many packet

flits, causing the network congested. Without an effective flow control mech-

anism, the performance of NoCs may degrade sharply due to the congestion.

According to [3], the accepted traffic increases linearly with the applied load

until a saturation point is reached. After the saturation point, the accepted

traffic decreases considerably.

There have already had many solutions solving the congestion situation in

the off-chip network [4–6]. However, most of them are not suitable for on-

chip network. In off-chip environments, dropping packets is usually used as a

means of flow control when congestion happens. Using this kind of control, the

environments must provide an acknowledgment mechanism. On the other hand,

on-chip network possesses reliable on-chip wires and more effective link-level

flow control, which make on-chip NoCs almost lossless. As a result, there is no

2

need to implement complicated protocols, such as acknowledgment, only for flow

control. This difference provides us the chance to come up with new solution.

To our best knowledge, there are very few research works discussing the

congestion control problem in NoCs. In [7], the switches exchange their load

information with neighboring switches to avoid hot spots where most packets

will pass through. In [8, 9], a predictive closed-loop flow control mechanism is

proposed based on a router model, which is used to predict how many flits the

router can accept in the next k time steps. However, it ignores the flits injected

by neighbor routers in the prediction period. In [10, 11], a centralized, end-to-

end flow control mechanism is proposed. However, they need a special network

called control NoC to transfer OS-control messages and they only rely on local

blocked messages to decide the time where a processing element is able to send

messages to the network.

Most of the works mentioned above detect network congestions by monitoring

the hardware status, such as buffer fillings, link utilization, and the amount of

blocked messages. However, the statuses are probably bounded due to the hard-

ware limitation. For example, the size and the number of buffers are limited,

so without adding any new hardware, the detection may be very inaccurate.

Particularly, if a bursty workload exceeds the limitation of the hardware, the

congestion information might not be detected immediately. In addition, con-

3

gestion detection based on hardware status is a reactive technique. It relies on

the backpressure to detect the network congestion, and thus the traffic sources

cannot throttle the injection rate immediately before the network is severely

congested. Furthermore, previous work on flow control of NoCs do not take

global information into consideration when making flow control decision. Even

if a certain core determines that the network is out of congestion and decides to

inject packets onto the network, some links or buffers of the other cores might

be still in congestion statuses causing more severe congestion.

In this thesis, we propose a proactive congestion and flow control mechanism.

The core idea is to predict the future global traffic in the NoC according to

the data transmission behaviors of the running applications. According to the

prediction, we can control network injection before congestion occurs. Notice

that most applications show repetitive communication patterns because they

likely execute similar codes in a time interval, such as a loop in the program.

These patterns may reflect the network states more accurately since applications

are the sources of the traffic in the network. Once the application patterns can

be predicted accurately, the future traffic of every link can be estimated based on

this information. The injection rate of each node can thus be controlled before

the network goes into congestion. However, predicting the traffic in a network

with high accuracy is a challenge. In this thesis, the data transmission behavior

4

of the running application is tracked and then used as the clues for predicting the

future traffic by a specialized table-driven predictor. This technique is inspired

by the branch predictor and works well for the end-to-end traffic of the network

[1].

The main contributions of this paper are as follows. First, we predict the

congestion according to the data transmission behaviors of applications rather

than the hardware statuses since data transmissions os application are the direct

source of NoC congestion of the network. Second, we modify the table-driven

predictor proposed in [1] to not only capture and predict the data transmission

behaviors in the application at run time, but also make the decision for the

injection rate control. Third, the implementation details for this traffic control

algorithm are presented. By taking the advantage of many-core architecture, we

can dedicate a core for making decisions on packet injection and achieving global

performance.

This thesis is organized as follows. In Chapter 2, a motivating example is given

to show the repetitive data transmission behavior in applications. In Chapter 3,

related works are discussed. Next, we give a formal definition of the flow control

problem in Chapter 4. In Chapter 5, we present the details of the traffic control

algorithm. Evaluations are shown in Chapter 6. Finally, conclusions are given

in Chapter 7.

5

Chapter 2

Motivating Example

In this chapter, we show that the data transmission behavior appears to have

repetitive patterns in the parallel programs by taking the LU decomposition

of the SPLASH-2 benchmark as an example. The LU decomposition kernel

is ported to TILE64 platform and run on 4 × 4 tile array as Figure 2.1 shows.

Detailed experiment setup is described in Chapter 6. We used 16 tiles for porting

the applications, and the routing algorithm is X-Y dimensional routing. In the

following discussion, we use the form of (source, destination) to describe the

transmission pairs.

Figure 2.2 shows the transmission trace of router 4. In the first diagram, the

traffic is mixed from the viewpoints of East. The mixed traffic is somewhat messy

and hard to predict. In previous works, the traffic prediction is made mainly

by checking the hardware status, such as the fullness of buffers, the utilization

of links, and so on. The hardware status is affected by the mixed traffic as the

6

0

4

8

12

1

5

9

13

2

6

10

14

3

7

11

15

Figure 2.1: The tile arrangement and interconnection topology used for experiment onTILE64 platform

first diagram shows. Irregular traffic makes hardware status not suitable for

predicting the network workload.

However, when we extract the traffic between the pairs of (5,4), (6,4) and

(7,4), as the second to the fourth diagram show, and the last diagram is for

the output traffic(4,5), they are more regular and predictable. The separated

transmission trace is recorded in the view point of end-to-end data transmission,

which issued by the running application. The end-to-end data transmission

behaves in some repetitive patterns since the application is executing similar

operations in the time intervals.

By utilizing the repetitive characteristic of application execution, we can pre-

dict the end-to-end data transmission accurately by recording the history. The

7

workload prediction for a given link in the network can be derived by summing

all the predicted end-to-end data transmission that passing through this link.

As we can predict the NoC traffic in the next time interval, we can control

the sources of the traffic and regulate them ahead of packet injection and the

congestion avoidance can also be realized.

8

0

5000

10000

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106

113

120

127

134

141

148

155

162

169

176

183

190

197

204

211

218

225

232

239

246

0

5000

10000

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106

113

120

127

134

141

148

155

162

169

176

183

190

197

204

211

218

225

232

239

246

East output East input

0

2000

4000

6000

1 8 5 2 9 6 3 0 7 4 1 8 5 2 9 6 3 0 7 4 1 8 5 2 9 6 3 0 7 4 1 8 5 2 9 6

0

5000

10000

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106

113

120

127

134

141

148

155

162

169

176

183

190

197

204

211

218

225

232

239

246

East output East input

0

2000

4000

6000

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106

113

120

127

134

141

148

155

162

169

176

183

190

197

204

211

218

225

232

239

246

5 to 4

2000

4000

6000

0

5000

10000

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106

113

120

127

134

141

148

155

162

169

176

183

190

197

204

211

218

225

232

239

246

East output East input

0

2000

4000

6000

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106

113

120

127

134

141

148

155

162

169

176

183

190

197

204

211

218

225

232

239

246

5 to 4

0

2000

4000

6000

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106

113

120

127

134

141

148

155

162

169

176

183

190

197

204

211

218

225

232

239

246

6 to 4

3000

0

5000

10000

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106

113

120

127

134

141

148

155

162

169

176

183

190

197

204

211

218

225

232

239

246

East output East input

0

2000

4000

6000

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106

113

120

127

134

141

148

155

162

169

176

183

190

197

204

211

218

225

232

239

246

5 to 4

0

2000

4000

6000

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106

113

120

127

134

141

148

155

162

169

176

183

190

197

204

211

218

225

232

239

246

6 to 4

0

1000

2000

3000

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106

113

120

127

134

141

148

155

162

169

176

183

190

197

204

211

218

225

232

239

246

7 to 4

0

2000

4000

6000

1 8 5 2 9 6 3 0 7 4 1 8 5 2 9 6 3 0 7 4 1 8 5 2 9 6 3 0 7 4 1 8 5 2 9 6

0

5000

10000

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106

113

120

127

134

141

148

155

162

169

176

183

190

197

204

211

218

225

232

239

246

East output East input

0

2000

4000

6000

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106

113

120

127

134

141

148

155

162

169

176

183

190

197

204

211

218

225

232

239

246

5 to 4

0

2000

4000

6000

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106

113

120

127

134

141

148

155

162

169

176

183

190

197

204

211

218

225

232

239

246

6 to 4

0

1000

2000

3000

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106

113

120

127

134

141

148

155

162

169

176

183

190

197

204

211

218

225

232

239

246

7 to 4

0

2000

4000

6000

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106

113

120

127

134

141

148

155

162

169

176

183

190

197

204

211

218

225

232

239

246

4 to 5

0

5000

10000

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106

113

120

127

134

141

148

155

162

169

176

183

190

197

204

211

218

225

232

239

246

East output East input

0

2000

4000

6000

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106

113

120

127

134

141

148

155

162

169

176

183

190

197

204

211

218

225

232

239

246

5 to 4

0

2000

4000

6000

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106

113

120

127

134

141

148

155

162

169

176

183

190

197

204

211

218

225

232

239

246

6 to 4

0

1000

2000

3000

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106

113

120

127

134

141

148

155

162

169

176

183

190

197

204

211

218

225

232

239

246

7 to 4

Figure 2.2: The traffic of router 4 is tracked. The first diagram is all the traffic input/outputfrom router 4. The second to the fourth diagrams show the decomposed traffic. Note thatthe traffic relayed by router 4 is omitted. The last one is the output traffic from router 4 to5

9

Chapter 3

Related Work

In [7], information of a switch is sent to other ones for deciding the routing path

to avoid the congestion. The control information is sent locally and cannot reflect

the statuses of the whole network. The authors predict network congestion based

on their proposed traffic source and router model in [8,9]. By using this model,

each router predicts the availability of its buffer ahead of time, i.e., how many

flits a router can accept currently. The traffic source cannot inject packets

until the availability is greater than zero. They predict traffic from the switch

perspective but our predictions are made from the perspectives of applications.

In [12–15], they consider a congestion control scenario which models flow

control as an utility maximization problem. These works propose an iterative

algorithm as the solution to the maximization problem.

The authors in [10] make use of the operating system (OS) and let the system

software to control the resource usage. In [11] the authors detail a NoC com-

10

munication management scheme based on a centralized, end-to-end flow control

mechanism by monitoring the hardware statuses. All the works above need a

dedicated control NoC to transfer OS-control message and a data NoC which is

responsible for delivering data packets. The OS refers to the blocked messages of

the local processing element to limit the time wherein the element is able to send

messages. In [16], almost the same network architecture is assumed except that

they add some extra hardware to support its distributed HW/SW congestion

control technique.

Model Predictive Control (MPC) is used for on-chip congestion control in

[17]. In this work, link utilization of a router is used as the indication for

the congestion measurement. In contrast, our work makes predictions from

the application-layer rather in the link-layer in order to obtain the transmission

behaviors of the running applications. We claim that these behaviors are actually

the main reason which brings about the network congestion.

11

Chapter 4

Problem Formulation

We have already known that congestion might degrade network performance

considerably. So congestion in the network should be avoided as possible as we

can. In [18], the queueing delay is used as one metric of congestion detection.

In [17], the authors use link utilization as congestion measure. Since there is no

official universally accepted definition of network congestion [19], we take link

utilization as congestion measure in this thesis. The utilization of a link ei at the

t-th time interval is defined as:

Utili(t) =Di(t)

T × W0 ≤ Utili ≤ 1

where Di(t) denotes the total data size transmitted by ei at the t-th time interval.

The period of a time interval is defined as T seconds and W is the maximum

bandwidth of a communication link. Thus T ×W denotes the maximum possible

data size transmitted in one time interval.

We make an assumption that if the link utilization of a given link in the

12

network exceeds a properly selected threshold Th, this link is congested. Ex-

perimental results in [17] asserts that 80 % link utilization results in reasonable

latencies before the congestion limit. However, this selected threshold value

should take some hardware configurations in to consideration such as the buffer

size and the link bandwidth.

We hope to prevent network from being congested before it happens. This

prospect is achieved by predicting possible traffic at the t-th time interval. We

hope to prevent several traffic sources from injecting packets concurrently. By

scheduling the packet injection effectively we can avoid network congestion and

then improve the average packet latency. Latency is a commonly used perfor-

mance metric and can be interpreted in different ways [3]. We define latency

here as the time elapsed from the message header is injected into the network

at the source node to the tail of packet is received at the destination node.

Assume that λ is the average packet latency and texec is the total execution

time without doing any flow control. λ′ is the average packet latency and t′exec

is the total execution time with our proposed flow control. Our goal is to max-

imize λ − λ′ and texec − t′exec. However, the execution time is affected by the

communication dependencies between traffic sources [20]. This will require fur-

ther discussion about dependencies in the program which is beyond the scope

of this thesis.

13

Cros

s-ba

r

Output 0

Output 1

Output 2

Output 3

Output 4

Input 0

Input 1

Input 2

Input 3

Input 4To local processorFrom local processor

To the N. router

To the W. router

To the S. router

To the E. router

From the N. router

From the W. router

From the S. router

From the E. router

Figure 4.1: The structure of a router

Dest. LRU Data Size G4 G3 G2 G1 G0

5 0 256 5 3 1 2 4

8 2 128 3 3 0 3 3

10 1 512 2 2 2 2 2

13 3 64 5 4 3 5 4

Transmission history

Figure 4.2: A example of a L1-table. The columns G4 : G0 record the quantized transmitteddata size of the last 5 time intervals.

4.1 Application-Driven Predictor

In this subsection, we show that how to predict traffic by using a table-driven

network traffic predictor and make traffic control decisions with an extra ta-

ble which records the delayed transmissions. This original prediction method

is proposed in [1]; however, they only discuss how to monitor and predict the

traffic without interfering in the traffic. In this thesis, the future transmissions

14

G4 G3 G2 G1 G0 LRU Gp

5 3 1 2 4 31 2

4 4 0 4 4 13 4

5 4 2 5 3 5 0

3 1 2 6 3 12 2

Indexed by L1-table

Figure 4.3: A example of a L2-table which is indexed by the transmission history patternG4 : G0. The corresponding data size level Gp is the value predicted to transmit in the nexttime interval

Src. Dest. Data size Priority

9 10 256 3

4 3 64 2

3 12 32 0

5 6 16 0

Figure 4.4: A table which records the delayed transmissions

15

are deeply controlled by our extended design. In order to simplify the following

discussion, we assume the 2D mesh network as the underlying topology, and the

size of the mesh network is N × N . Note that our approach is independent of

the topology and the size of the network, so it can be easily extended to other

network topology and arbitrary size of network. Each tile consists of a processor

core, a memory module and a router. We assume that the router has 5 input

and 5 output ports and a 5 × 5 crossbar. The structure of a router is shown in

Figure 4.1. Each crossbar contains five connections: east, north, west, south and

the local processor. Each connection consists of two uni-directional communi-

cation links for sending and receiving data, respectively. Deterministic routing

algorithm is assumed so that the path between a source and a destination is

determined in advance. This is the most common type of the routing algorithms

in the current NoC implementations.

A table-driven predictor is employed to record the traffic of the past history

and then we make use of the history to predict the data size and the destina-

tion of the outgoing traffic from each router in the next time interval. Each

router maintains two hierarchical tables for tracking and predicting the data

transmission. The first level table (L1-table) as shown in Figure 4.2 tracks all

output data transmissions. Each router here uses only four entries to record

transmission destination since a core may only communicate with a subset of

16

all the cores [1]. The destination entry can be replaced by the LRU replace-

ment policy for reducing the size of the table. In order to map the patterns to

guess the following transmission, a second-level table (L2-table) is required. At

the beginning of the t-th time interval, the transmission history recorded in the

L1-table is used to index the L2-table to get the predicted level of the trans-

mission data size at the t-th time interval. At the end of the t-th time interval,

when an output transmission is issued by the processor core, the destination

and data size are recorded in L1-table. The data size is quantized and recorded

in G0. The columns from G0 to Gn records the quantized transmitted data size

of the last n + 1 time intervals. The two tables are updated at the end of each

predefined time interval. After checking the prediction, the value of the data

size counter in the L1-table is quantized and shift into G0. Finally, the updated

transmission history in the L1-table is used to index the L2-table and retrieve

the predicted data size level that will be transmitted in the next time interval. If

the transmission history can not be found in the L2-table, the system will either

create a new entry or replace the existing entry by LRU in the L2-table, and use

the last value (G0) as the predicted transmission data size level. The recorded

transmitted data size levels in the L1-table are used to check the accuracy of the

prediction made at the last time interval. If the prediction was wrong, the value

of Gp at the L2-table for the corresponding transmission history pattern will be

17

modified to the data size level recorded in L1-table.

Besides the traffic predictor, we need to maintain another table to record

the delayed transmission, as shown in Table 4.4. As the traffic control algorithm

decides to delay a transmission, we need to record the source and the destination

and the traffic size. In order to avoid starvation, we need to add the priority

column. As the transmission is determined to delay for another interval, the

value in the priority column is also increased.

18

Chapter 5

Traffic Control Algorithm

In this chapter, we present a heuristic algorithm for NoC traffic management.

Then, we give some possible solutions to aggregate the prediction data.

5.1 Traffic Control Algorithm and Implementation

Overhead

The algorithm, detailed in Algorithm 1 is for a central control system and the

algorithm detailed in Algorithm 2 is for each node. This control system needs to

maintain two tables: one is to record those transmissions which are delayed and

another table is to record those transmissions which are predicted. Upon receiv-

ing these transmissions, this control system has to decide which transmission

should be delayed and which transmission should be injected. The control mes-

sage inject sent from control system to each node to decide whether the source

node i can inject or not into the destination node j in the next time interval.

Noticeably, Algorithm 1 is executed at the beginning of the time interval and

19

then Algorithm 2 is executed between the time interval.

Because this flow control algorithm is from the end-to-end layer, we use inject

to indicate if source i can send packets to destination j. Figure 5.1 is a simple

flow chart to explain our flow control algorithm.

At the beginning, we assume that each source can send traffic to each desti-

nation (line 3). Then, the algorithm will have to decide which transmission in

the delay table can inject (line 5 - 22). Each transmission has its own priority

to avoid starvation. (line 6). The transmission with the highest priority is the

one which has the longest delay. The workload (line 10) includes the workload

which has not finished processing before and the workload which may inject in

the next time interval. If the workload of any link exceeds the threshold value,

the control signal should be set false (line 11). The threshold value depends on

the architecture. After deciding which transmission in the delay table should

inject, the remaining transmissions should update their priority (line 23). The

control system collect transmissions which are predicted to inject in the next

time interval from the predictor and decide that the control signal value should

be true or false.

Algorithm 2 is executed in each source node between a time interval. Every

source node receives the control message from the control system and makes

decisions (line 1). When there is transmission from source i to destination j,

20

Figure 5.1: The diagram of the flow control algorithm

if the control message value is true, it means that the source node is allowed

to inject traffic onto the network; on the opposite, the source node should not

inject any traffic and add this transmission to the centralize delay table.

To deserve to be mentioned, the algorithms presented here are just a example

for flow control as we know how to predict NoC traffic. There may be other

available algorithms to solve flow control problems.

5.2 Data Aggregation

Figure 5.2 is the basic idea of our proposed method. The control system is re-

sponsible for Algorithm 1 and each node is responsible for Algorithm 2. The

21

Algorithm 1 Algorithm for central control system

1: // Initialization. inject[src][dest] is a control message to decide injecting or not.2: for all source-to-destination transmission pairs do3: inject[src][dest] = true;4: end for5: for all transmissions in the delay table do6: Selecting the transmission Tdelay i,j with the highest priority;7: Let path be the routing path of Tdelay i,j

8: if injected[i][j] == true then9: for all link ε path do

10: if link.workload ¿ threshold then11: inject[i][j] = false;12: break;13: end if14: // send the control message to the nodes15: if inject[i][j] == true then16: Sending injecting notification to node i to inject Tdelay i,j;17: Updating link.workload;18: Deleting Tdelay i,j from delay table19: end if20: end for21: end if22: end for23: Updating delay table for priority;24: Collecting predicted transmissions from application-driven predictor;25: for all predicted transmissions do26: Selecting the transmission Tpredict i,j with the highest priority;27: if inject[i][j] == true then28: for all link ε path do29: if link.workload ¿ threshold then30: inject[i][j] = false;31: break;32: if inject[i][j] == true then33: Updating link.workload;34: Deleting Tpredict i,j from the predicted transmissions35: end if36: end if37: end for38: end if39: end for

22

Algorithm 2 Algorithm for each node i

1: receive the control message;2: if there is a transmission to destination j then3: if inject[i][j] == true then4: inject;5: else6: Adding the transmission to the centralized delay table;7: end if8: end if9: Updating application-driven predictor;

10: Updating link.workload;

control system is bound to send control signal to each node via control net-

work and each node needs to send some information via control network to the

control system to help the control’s system make decision. Each node communi-

cates with each other by data network. In [10], the authors think the operating

system is capable of network traffic management. For this reason, our method

can be adopted to the architecture platform mentioned in [10] and the control

system can be seen as the operating system. However, this method may be too

troublesome so we propose an alternative. Since there are many existing cores,

we can use a dedicated core to handle the flow control decision. This dedicated

core stands for the control system in Figure 5.2.

5.3 Area Occupancy

Then, we analyze the area overhead of the NTPT. In this subsection, we use the

number of transistors in real manycore design. In UC Davis AsAP, it has 55M

transistors, and in Tileras TILE64, the number of transistors is 615M. Assuming

23

control signalCores Data Network

update information in control system

cconon

Application-Driven Traffic Predictor

(via control network)

(via control network)

Control System

Figure 5.2: The diagram of flow control

that each bit needs 6 transistors, in our design the application-driven predictor

needs 0.69M transistors when the number of cores is 64. And because we need to

maintain another table named control table to record the delayed transmissions

and here we assume that the number of entry is 128 and the needed transistors

are about 0.02M. The application-driven predictor occupies 1.29% and 0.12% in

AsAP and TILE64, respectively. It is quite small and tolerable area overhead.

However, [21] addresses that an increase of the data path width by 138% results

in an area penalty of 64% in Xpipe, which is a NoC architecture. The area

overhead is extremely considerable. The average packet latency changes from 49

cycles to 39 cycles as the link bandwidth enlarges from 2.2 GB/s to 3.2 GB/s. In

short, the average packet latency improves slightly as the link bandwidth enlarges

and results in huge area overhead. This conclusion gives us the motivation to do

the inject rate flow control since increasing the link bandwidth is not economic.

24

Chapter 6

Experimental Results

In this chapter, we will demonstrate the experimental result to evaluate our

proposed flow control algorithm. We adopt both real application traffic and

synthetic traffic in our experiment.

6.1 Simulation Setup

The PoPNet network simulator [22] is used for our simulations and the data

transmission traces are used as the input of the simulator. The data transmission

traces record the packet injection time, the address of the source router, the

address of the destination router and the packet size. The detailed configuration

of simulation is provided in Table 6.1. The original data transmission traces are

altered by our flow control algorithm, and this results in that some transmissions

are delayed for some period so as to avoid congestion. The experimental results

presented in the following show that our algorithm exhibits huge performance

improvement.

25

Table 6.1: Simulation Configuration

Network Topology mesh 4x4

Virtual Channel 3

Buffer Size 12

Routing Algorithm x-y routing

Bandwidth 32 byte

6.2 Real Application Traffic

The Tilera’s TILE64 platform is used to run the benchmark programs and collect

the data transmission traces. We use SPLASH-2 blocked LU decomposition

as our benchmark program. The total workload is 3991 packets. As shown

in Table 6.2, the average packet latency drops from 2410.79 cycles to 771.858

cycles and the maximum packet latency drops from 5332 cycles to 3242 cycles.

The significant performance improvement origins from that we predict traffic

workload in the next interval and delay some packet injection to avoid congestion.

As depicted in Figure 6.1 (a), the packet latencies without flow control range

between 0 cycles and 5500 cycles. However, with our proposed flow control

algorithm, the packet latencies range between 0 cycles and 3300 cycles. These

packet latencies have decreased violently so that the histogram shifts to the left

side. To bear up our conviction, Figure 6.2 demonstrates more details about

26

Original Pattern-oriented

Reduction

Ave. latency 2410.79 cycles 771.858 cycles 3.12

Max. latency 5332 cycles 3242 cycles 1.64

Simulation Cycle

5600 cycles 6100 cycles 0.92

Table 6.2: Our proposed flow control algorithm leads to the huge reduction in the latencyand slight execution time overhead.

the network congestion. We set the congestion threshold as 40 flits. The line in

the Figure 6.2 (b) goes above the threshold is because of the wrong predictions

of network traffic. However, the impact of miss prediction is slight so that the

result is under our acceptable scope. In Figure 6.2 (a) without flow control,

the maximum workload is far apart from the threshold, and consequently causes

severe network congestion.

6.3 Synthetic Traffic

Besides the real application traffic, we also extend our algorithm for synthetic

traffic. In [20], the authors state that injected network traffic possesses self-

similar temporal properties. They use a single parameter, the Hurst exponent H

to capture temporal burstiness characteristic of NoC traffic. Based on this traffic

model, we synthesize our traffic traces. In Table 6.3, we give some instances

27

0

20

40

60

80

100

120

140

160

Num

ber o

f pac

kets

Packet Latency (cycles)

Histogram of the packet latencies

(a)

0

50

100

150

200

250

300

350

400

100

200

300

400

500

600

700

800

900

1000

1100

1200

1300

1400

1500

1600

1700

1800

1900

2000

2100

2200

2300

2400

2500

2600

2700

2800

2900

3000

3100

Num

ber o

f pac

kets

Packet Latency (cycles)

Histogram of the packet latencies

(b)

Figure 6.1: Histograms of the packet latencies without (a) and with (b) the proposed flowcontrol and in (b) the latencies slow down drastically.

28

0

500

1000

1500

2000

2500

3000

3500

4000

4500

1 9 17 25 33 41 49 57 65 73 81 89 97 105

113

121

129

137

145

153

161

169

177

185

193

201

209

217

225

233

241

249

257

265

273

281

289

297

305

313

Max

wor

kloa

d of

link

(flit

s)

Time (cycles)

original

0

10

20

30

40

50

60

70

1 10 19 28 37 46 55 64 73 82 91 100

109

118

127

136

145

154

163

172

181

190

199

208

217

226

235

244

253

262

271

280

289

298

307

316

Max

wor

kloa

d of

link

(flit

s)

Time (cycles)

pattern-oriented

Figure 6.2: The maximum workload of links in the network without (a) and with (b) theproposed flow control.

29

of different parameters H and make some comparisons. These parameters are

chosen based on [20]. Table 1 in [20] shows some values of Hurst exponent H and

in our thesis we choose some values among them as a matter of convenience. The

average packet latency and the maximum latency both drop down significantly.

Besides, the execution time with our proposed flow control is a little better than

that which is without flow control. Relatively large H values indicate highly

self-similar traffic and higher traffic prediction accuracy rate. But because the

average packet size also increases with H, the reduction does not arise linearly

with H

30

H 0.576 0.661 0.768 0.855 0.978

OriginalAve. latency

3553.14 cycles

3596.45 cycles

3649.21 cycles

3665.53 cycles

3614.56 cycles

Improved Ave. latency

482.512 cycles

467.787cycles

387.716 cycles

412.983 cycles

417.577 cycles

Reductionof Ave. latency

7.364 7.688 9.412 8.876 8.656

Original Max.

latency

7623 cycles

7623cycles

7710 cycles

7658 cycles

7714 cycles

Improved Max.

latency

1591 cycles

1532 cycles

1016 cycles

1054 cycles

1037 cycles

Reduction of Max. latency

4.791 4.976 7.589 7.266 7.438

Original Simulation

Cycle

8580 cycles

8510 cycles

8550 cycles

8480 cycles

8450 cycles

Improved Simulation

Cycle

8280 cycles

8260 cycles

7690 cycles

7781 cycles

7731 cycles

Table 6.3: Our proposed flow control algorithm for synthetic traffic leads to the huge reduc-tion in the average latency and the maximum latency and slight reduction in the executiontime.

31

Chapter 7

Conclusion and Future Works

Our thesis proposes an application-oriented flow control for packet-switched

networks-on-chip. By tracking and predicting the end-to-end transmission be-

havior of the running applications, we can limit the traffic injection when the

network is heavily loaded. By delaying some transmissions efficiently, the aver-

age packet latency can be decreased significantly so that the performance can

be improved obviously. In our experiments, we adopt real application traffic

traces as well as synthetic traffic traces. The experimental result shows that

our proposed flow control not only decreases the average packet latency and the

maximum latency, but under some condition the execution time can even be

shortened.

Future work will focus on improving the accuracy of the application-oriented

traffic prediction. Also, the simulation configuration should be further discussed.

Determining the optimal parameter and adjusting flow control algorithm are

32

also important. Besides, we ignore the communication dependencies between

the traffic traces because there is difficulty in considering about this issue.

33

Bibliography

[1] Y. S.-C. Huang, C.-K. Chou, C.-T. King, and S.-Y. Tseng, “Ntpt: On the

end-to-end traffic prediction in the on-chip networks”, in Proc. 47th ACM

IEEE Design Automation Conference, 2010.

[2] S. Bell, B. Edwards, J. Amann, R. Conlin, K. Joyce, V. Leung, J. MacKay,

M. Reif, Liewei Bao, J. Brown, M. Mattina, Chyi-Chang Miao, C. Ramey,

D. Wentzlaff, W. Anderson, E. Berger, N. Fairbanks, D. Khan, F. Montene-

gro, J. Stickney, and J. Zook, “Tile64 - processor: A 64-core soc with mesh

interconnect”, in Proc. Digest of Technical Papers. IEEE International

Solid-State Circuits Conference ISSCC 2008, Feb. 3–7, 2008, pp. 88–598.

[3] Jose Duato, Sudhakar Yalmanchili, and Lionel Ni, “Interconnection net-

works”, 2002, pp. 428–431.

[4] S. Mascolo, “Classical control theory for congestion avoidance in high-speed

internet”, in Proc. Decision and Control Conference, 1999.

34

[5] Cui-Qing Yang, “A taxonomy for congestion control algorithms in packet

switching networks”, in IEEE Network, 1995.

[6] Hua Yongru Gu, Wang Hua O., and Hong Yiguang, “A predictive conges-

tion control algorithm for high speed communication networks”, in Proc.

American Control Conference, 2001.

[7] Erland Nillson, Mikael Millberg, Johnny Oberg, and Axel Jantsch, “Load

distribution with the proximity congestion awareness in a network on chip”,

in Proc. Design, Automation, and Test in Europe, 2003, p. 11126.

[8] U. Y. Ogras and R. Marculescu, “Prediction-based flow control for network-

on-chip traffic”, in Proc. 43rd ACM IEEE Design Automation Conference,

2006, pp. 839–844.

[9] U. Y. Ogras and R. Marculescu, “Analysis and optimization of prediction-

based flow control in networks-on-chip”, in ACM Transactions on Design

Automation of Electronic Systems, 2008.

[10] Vincent Nollet, Theodore. Marescaux, and Diederik Verkest, “Operating-

system controlled network on chip”, in Proc. 41st ACM IEEE Deaign

Automation Conference, 2004.

35

[11] P. Avasare, J-Y. Nollet, D. Verkest, and H. Corporaal, “Centralized end-

to-end flow control in a best-effort network-on-chip”, in Proc. 5th ACM

internatinoal conference on Embedded software, 2005.

[12] Mohammad S. Talebi, Fahimeh Jafari, and Ahmad Khonsari, “A novel

congestion control scheme for elastic flows in network-on-chip based on sum-

rate optimization”, in ICCSA, 2007.

[13] M. S. Talebi, F. Jafari, and A. Khonsari, “A novel flow control scheme

for best effort traffic in noc based on source rate utility maximization”, in

MASCOTs, 2007.

[14] Mohammad S. Talebi, Fahimeh Jafari, Ahmad Khonsari, and Mohammad H.

Yaghmaeem, “Best effort flow control in network-on-chip”, in CSICC, 2008.

[15] Fahimeh Jafari, Mohammad S. Talebi, Mohammad H. Yaghmaee, Ahmad

Khonsari, and Mohamed Ould-Khaoua, “Throughput-fairness tradeoff in

best effort flow control for on-chip architectures”, in Proc. 2009 IEEE

International Symposium on Parallel and Distributed Processing, 2009.

[16] T. Marescaux, A. Rangevall, V. Nollet, A. Bartic, and H. Corporaal, “Dis-

tributed congestion control for packet switched networks on chip”, in

ParCo, 2005.

36

[17] J.W. van den Brand, C. Ciordas, K. Goossens, and T. Basten, “Congestion-

controlled best-effort communication for networks-on-chip”, in Proc. De-

sign, Automation, and Test in Europe, 2007.

[18] Jin Yuho, Yum Ki Hwan, and Kim Eun Jung, “Adaptive data compression

for high-performance low-power on-chip networks”, in Proc. 41st annual

IEEE/ACM International Symposium on Microarchitecture, 2008.

[19] Keshav Srinivasan, “Congestion control in computer networks”, 1991.

[20] Vassos Soteriou, Hangsheng Wang, and Li-Shiuan Peh, “A statistical traffic

model for on-chip interconnection networks”, in Proc. 14th IEEE Interna-

tional Symposium on Modeling, Analysis, and Simulation, 2006.

[21] Anthony Leroy, “Optimizing the on-chip communication architecture of low

power systems-on-chip in deep sub-micron technology”, 2006.

[22] N. Agarwal, T. Krishna, L. Peh, and N. Jha, “Garnet: A detailed on-chip

network model inside a full-system simulator”, in Proceedings of Inter-

national Symposium on Performance Analysis of Systems and Software,

2009.

37