Application Behavior-aware Flow Controlin Network-on-Chip
Advisor: Chung-Ta KingStudent: Huan-Yu Liu
Department of Computer ScienceNational Tsing Hua University
Hsinchu, Taiwan 30013R.O.C.
July, 2010
Abstract
Multicore might be the only solution when concerning about performance and
power issues in future chip processor architecture. As the number of cores on
a chip keeps on increasing, traditional bus-based architectures are incapable of
offering the required communication bandwidth on the chip, so Network-on-chip
(NoC) becomes the main paradigm for on-chip interconnection. NoCs not only
offer significant bandwidth advantages but also provide outstanding flexibility.
However, the performance of NoCs can be degraded significantly if the network
flow is not controlled properly. Most previous solutions try to detect network
congestion by monitoring the hardware status of the network switches or links.
Change of hardware statuses at local end may indicate possible congestions in the
network, and thus packet injection into the network should be controlled to react
to the congestions. The problem with these solutions is that congestion detection
is based only on local status without global information. Actual congestions may
occur somewhere else and can only be detected through backpressure, which may
be too passive and too slow for taking reactive measures in time.
This work takes a proactive approach for congestion detection. The idea is to
predict the changes in global, end-to-end network traffic patterns of the running
1
application and take proactive flow control actions to avoid possible congestions.
Traffic prediction is based on our recent paper [1], which uses a table-driven
predictor for predicting application communication patterns. In this thesis, we
discuss how to use the prediction results for effective scheduling of packet injec-
tion to avoid network congestions and improve the throughput. The proposed
scheme is evaluated using simulation based on a SPLASH-2 benchmark as well
as synthetic traffic. The results show its superior performance improvement and
negligible execution overhead.
i
摘要
當考慮到在將來的晶片處理器架構的效能跟電的議題時,多核心可能是唯一
的解決方法。當晶片上的核心數量一直不斷增加時,傳統的以匯流排為主的架構
已經不能滿足晶片上需要的傳輸頻寬,而晶片網路(NoC)就成為晶片上互連傳輸
的主流。晶片網路不只提供可觀的頻寬的優點,也展現出它很傑出的彈性。然而,
假如網路的流量不能被適當的控管晶片,網路效能會大大的降低。大部份以前的
解法是藉著偵測網路上的交換器跟連結的硬體狀態嘗試去偵測網路擁塞。這些局
部端點的硬體狀態的改變可以指出網路上可能會發生的壅塞,再藉由壅塞的狀態
去控制網路的封包注入。這些偵測網路壅塞的方法是只有看局部的硬體狀態,並
沒有考慮整個網路的情況,實際上網路的壅塞不是只會發生在局部的地方,而是
可能會發生在網路的其他地方。而且用硬體狀態偵測網路壅塞的方法,是一種回
壓的機制,這是一個很被動也太慢的方法,並不能及時的反應真正的網路壅塞。
這篇論文就採用一個比較前瞻性的方法來偵測網路壅塞。概念就是去預測正
在執行的應用程式的網路流量模型,這是一種看總體網路的點對點傳輸的方式,
藉由這個預測方法做流量控制來避免網路壅塞。網路流量的預測是以一篇最近的
論文當作基礎,它的手法是用表來紀錄應用程式傳輸模型而達到預測的目的。在
這篇論文,我們討論到如何用這些預測出來的結果來做有效的封包注入的行程,
以避免網路壅塞而且也能提高總體的處理能力。我們提出的這個系統是使用
SPLASH-2來評估我們的模擬,另外也用了合成的網路流量來作實驗。這些實驗
誌謝
假如要問我,兩年的碩士研究生活跟四年的大學生活的充實程度那我一定毫
不猶豫選擇前者。大學四年要過的生活無非就是上課,唸書跟考試。但上研究所
就不一樣了,除了修課唸書跟考試之外,還要忙著做計劃跟找論文的研究題目。
說到找論文題目的過程,真的是一波三折啊,換了又換,總是找不到一個合適的
題目。最後總算找到論文題目時,卻總是活在不能如期完成的恐慌跟壓力下。然
而,最後還是順利完全這個題目,做出一個成果,在這個研究的過程中雖充滿著
挫折,卻也是學到很多,不管是在研究上或精神上。
第一個要感謝的就是我的指導教授金仲達老師,給予我的論文許多有用的寶
貴意見,指出我的盲點,讓我的研究可以順利進行,所以萬分的感謝老師。
還有要感謝 Multi-core的各位大大,尤其是有希學長跟布拉胖學姐,給我
論文很多實質的幫助,學長更是常常不厭其煩的跟我討論,是我論文能順利完成
的一大推手。另外還有 kaven 大大,你的強度是讓我望塵莫及,我也常常拿論文
的東西來跟你討論,讓我收穫良多,還有你的嘴砲也是常給我們帶來歡樂。雙翔
學弟更是一對活寶,更是讓我相信我們 Multi-core組真的是由奇人異士組成
的。
感謝 PADS的每個博班學長和學姐還有同儕和其他的學弟,雖然我們研究領
域不同,但是因為有你們,我們 PADS 總是充滿著歡樂,讓我雖處在畢業的壓力,
心情還是蠻愉快的,所以也要感謝你們。
Contents
1 Introduction 1
2 Motivating Example 6
3 Related Work 10
4 Problem Formulation 12
4.1 Application-Driven Predictor . . . . . . . . . . . . . . . . . . . . . . 14
5 Traffic Control Algorithm 19
5.1 Traffic Control Algorithm and Implementation Overhead . . . . . . 19
5.2 Data Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5.3 Area Occupancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
6 Experimental Results 25
6.1 Simulation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
6.2 Real Application Traffic . . . . . . . . . . . . . . . . . . . . . . . . . . 26
ii
6.3 Synthetic Traffic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
7 Conclusion and Future Works 32
iii
List of Tables
6.1 Simulation Configuration . . . . . . . . . . . . . . . . . . . . . . . . . 26
6.2 Our proposed flow control algorithm leads to the huge reduction
in the latency and slight execution time overhead. . . . . . . . . . . 27
6.3 Our proposed flow control algorithm for synthetic traffic leads
to the huge reduction in the average latency and the maximum
latency and slight reduction in the execution time. . . . . . . . . . 31
iv
List of Figures
2.1 The tile arrangement and interconnection topology used for ex-
periment on TILE64 platform . . . . . . . . . . . . . . . . . . . . . . 7
2.2 The traffic of router 4 is tracked. The first diagram is all the traffic
input/output from router 4. The second to the fourth diagrams
show the decomposed traffic. Note that the traffic relayed by
router 4 is omitted. The last one is the output traffic from router
4 to 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.1 The structure of a router . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.2 A example of a L1-table. The columns G4 : G0 record the quantized
transmitted data size of the last 5 time intervals. . . . . . . . . . . . 14
4.3 A example of a L2-table which is indexed by the transmission
history pattern G4 : G0. The corresponding data size level Gp is
the value predicted to transmit in the next time interval . . . . . . 15
4.4 A table which records the delayed transmissions . . . . . . . . . . . 15
5.1 The diagram of the flow control algorithm . . . . . . . . . . . . . . . 21
v
5.2 The diagram of flow control . . . . . . . . . . . . . . . . . . . . . . . 24
6.1 Histograms of the packet latencies without (a) and with (b) the
proposed flow control and in (b) the latencies slow down drastically. 28
6.2 The maximum workload of links in the network without (a) and
with (b) the proposed flow control. . . . . . . . . . . . . . . . . . . . 29
vi
Chapter 1
Introduction
The number of transistors on a chip has increased exponentially over the decades
according to the Moore’s Law. At the same time, applications, such as process-
ing, have also increased in complexities and therefore require huge computations.
These factors are further coupled with the increasing need for power saving is
also as the clock frequency of a core increases. The best practice at this is
to go for multicore architecture and application parallelization. However, the
communication overhead would be a critical bottleneck if we cannot offer sub-
stantial bandwidth among the core. Traditional bus-based architectures suffer
from increased packet latencies as the number of cores on-chip increases and are
incapable of providing performance guarantees especially for real-time applica-
tions. As a result, Network-on-chip (NoC) becomes a de facto solution to handle
this critical problem.
NoCs not only offer significant bandwidth but also provide outstanding flex-
1
ibility and scalability. There have already had multi- and many-core processors
at the market that adopt NoC as their communication fabric. For example,
Tilera’s TILE64 [2] introduced in 2007 uses a 2-D mesh-based network to inter-
connect 64 tiles and 4 memory controllers. Indeed, NoCs are becoming the main
communication and design fabric for chip-level multiprocessors.
Since the cores are connected by a network, the flow control and congestion
control in NoC are certainly important issues. If a core transmits too many
packets to another core, the intermediate routers need to buffer many packet
flits, causing the network congested. Without an effective flow control mech-
anism, the performance of NoCs may degrade sharply due to the congestion.
According to [3], the accepted traffic increases linearly with the applied load
until a saturation point is reached. After the saturation point, the accepted
traffic decreases considerably.
There have already had many solutions solving the congestion situation in
the off-chip network [4–6]. However, most of them are not suitable for on-
chip network. In off-chip environments, dropping packets is usually used as a
means of flow control when congestion happens. Using this kind of control, the
environments must provide an acknowledgment mechanism. On the other hand,
on-chip network possesses reliable on-chip wires and more effective link-level
flow control, which make on-chip NoCs almost lossless. As a result, there is no
2
need to implement complicated protocols, such as acknowledgment, only for flow
control. This difference provides us the chance to come up with new solution.
To our best knowledge, there are very few research works discussing the
congestion control problem in NoCs. In [7], the switches exchange their load
information with neighboring switches to avoid hot spots where most packets
will pass through. In [8, 9], a predictive closed-loop flow control mechanism is
proposed based on a router model, which is used to predict how many flits the
router can accept in the next k time steps. However, it ignores the flits injected
by neighbor routers in the prediction period. In [10, 11], a centralized, end-to-
end flow control mechanism is proposed. However, they need a special network
called control NoC to transfer OS-control messages and they only rely on local
blocked messages to decide the time where a processing element is able to send
messages to the network.
Most of the works mentioned above detect network congestions by monitoring
the hardware status, such as buffer fillings, link utilization, and the amount of
blocked messages. However, the statuses are probably bounded due to the hard-
ware limitation. For example, the size and the number of buffers are limited,
so without adding any new hardware, the detection may be very inaccurate.
Particularly, if a bursty workload exceeds the limitation of the hardware, the
congestion information might not be detected immediately. In addition, con-
3
gestion detection based on hardware status is a reactive technique. It relies on
the backpressure to detect the network congestion, and thus the traffic sources
cannot throttle the injection rate immediately before the network is severely
congested. Furthermore, previous work on flow control of NoCs do not take
global information into consideration when making flow control decision. Even
if a certain core determines that the network is out of congestion and decides to
inject packets onto the network, some links or buffers of the other cores might
be still in congestion statuses causing more severe congestion.
In this thesis, we propose a proactive congestion and flow control mechanism.
The core idea is to predict the future global traffic in the NoC according to
the data transmission behaviors of the running applications. According to the
prediction, we can control network injection before congestion occurs. Notice
that most applications show repetitive communication patterns because they
likely execute similar codes in a time interval, such as a loop in the program.
These patterns may reflect the network states more accurately since applications
are the sources of the traffic in the network. Once the application patterns can
be predicted accurately, the future traffic of every link can be estimated based on
this information. The injection rate of each node can thus be controlled before
the network goes into congestion. However, predicting the traffic in a network
with high accuracy is a challenge. In this thesis, the data transmission behavior
4
of the running application is tracked and then used as the clues for predicting the
future traffic by a specialized table-driven predictor. This technique is inspired
by the branch predictor and works well for the end-to-end traffic of the network
[1].
The main contributions of this paper are as follows. First, we predict the
congestion according to the data transmission behaviors of applications rather
than the hardware statuses since data transmissions os application are the direct
source of NoC congestion of the network. Second, we modify the table-driven
predictor proposed in [1] to not only capture and predict the data transmission
behaviors in the application at run time, but also make the decision for the
injection rate control. Third, the implementation details for this traffic control
algorithm are presented. By taking the advantage of many-core architecture, we
can dedicate a core for making decisions on packet injection and achieving global
performance.
This thesis is organized as follows. In Chapter 2, a motivating example is given
to show the repetitive data transmission behavior in applications. In Chapter 3,
related works are discussed. Next, we give a formal definition of the flow control
problem in Chapter 4. In Chapter 5, we present the details of the traffic control
algorithm. Evaluations are shown in Chapter 6. Finally, conclusions are given
in Chapter 7.
5
Chapter 2
Motivating Example
In this chapter, we show that the data transmission behavior appears to have
repetitive patterns in the parallel programs by taking the LU decomposition
of the SPLASH-2 benchmark as an example. The LU decomposition kernel
is ported to TILE64 platform and run on 4 × 4 tile array as Figure 2.1 shows.
Detailed experiment setup is described in Chapter 6. We used 16 tiles for porting
the applications, and the routing algorithm is X-Y dimensional routing. In the
following discussion, we use the form of (source, destination) to describe the
transmission pairs.
Figure 2.2 shows the transmission trace of router 4. In the first diagram, the
traffic is mixed from the viewpoints of East. The mixed traffic is somewhat messy
and hard to predict. In previous works, the traffic prediction is made mainly
by checking the hardware status, such as the fullness of buffers, the utilization
of links, and so on. The hardware status is affected by the mixed traffic as the
6
0
4
8
12
1
5
9
13
2
6
10
14
3
7
11
15
Figure 2.1: The tile arrangement and interconnection topology used for experiment onTILE64 platform
first diagram shows. Irregular traffic makes hardware status not suitable for
predicting the network workload.
However, when we extract the traffic between the pairs of (5,4), (6,4) and
(7,4), as the second to the fourth diagram show, and the last diagram is for
the output traffic(4,5), they are more regular and predictable. The separated
transmission trace is recorded in the view point of end-to-end data transmission,
which issued by the running application. The end-to-end data transmission
behaves in some repetitive patterns since the application is executing similar
operations in the time intervals.
By utilizing the repetitive characteristic of application execution, we can pre-
dict the end-to-end data transmission accurately by recording the history. The
7
workload prediction for a given link in the network can be derived by summing
all the predicted end-to-end data transmission that passing through this link.
As we can predict the NoC traffic in the next time interval, we can control
the sources of the traffic and regulate them ahead of packet injection and the
congestion avoidance can also be realized.
8
0
5000
10000
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106
113
120
127
134
141
148
155
162
169
176
183
190
197
204
211
218
225
232
239
246
0
5000
10000
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106
113
120
127
134
141
148
155
162
169
176
183
190
197
204
211
218
225
232
239
246
East output East input
0
2000
4000
6000
1 8 5 2 9 6 3 0 7 4 1 8 5 2 9 6 3 0 7 4 1 8 5 2 9 6 3 0 7 4 1 8 5 2 9 6
0
5000
10000
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106
113
120
127
134
141
148
155
162
169
176
183
190
197
204
211
218
225
232
239
246
East output East input
0
2000
4000
6000
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106
113
120
127
134
141
148
155
162
169
176
183
190
197
204
211
218
225
232
239
246
5 to 4
2000
4000
6000
0
5000
10000
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106
113
120
127
134
141
148
155
162
169
176
183
190
197
204
211
218
225
232
239
246
East output East input
0
2000
4000
6000
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106
113
120
127
134
141
148
155
162
169
176
183
190
197
204
211
218
225
232
239
246
5 to 4
0
2000
4000
6000
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106
113
120
127
134
141
148
155
162
169
176
183
190
197
204
211
218
225
232
239
246
6 to 4
3000
0
5000
10000
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106
113
120
127
134
141
148
155
162
169
176
183
190
197
204
211
218
225
232
239
246
East output East input
0
2000
4000
6000
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106
113
120
127
134
141
148
155
162
169
176
183
190
197
204
211
218
225
232
239
246
5 to 4
0
2000
4000
6000
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106
113
120
127
134
141
148
155
162
169
176
183
190
197
204
211
218
225
232
239
246
6 to 4
0
1000
2000
3000
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106
113
120
127
134
141
148
155
162
169
176
183
190
197
204
211
218
225
232
239
246
7 to 4
0
2000
4000
6000
1 8 5 2 9 6 3 0 7 4 1 8 5 2 9 6 3 0 7 4 1 8 5 2 9 6 3 0 7 4 1 8 5 2 9 6
0
5000
10000
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106
113
120
127
134
141
148
155
162
169
176
183
190
197
204
211
218
225
232
239
246
East output East input
0
2000
4000
6000
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106
113
120
127
134
141
148
155
162
169
176
183
190
197
204
211
218
225
232
239
246
5 to 4
0
2000
4000
6000
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106
113
120
127
134
141
148
155
162
169
176
183
190
197
204
211
218
225
232
239
246
6 to 4
0
1000
2000
3000
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106
113
120
127
134
141
148
155
162
169
176
183
190
197
204
211
218
225
232
239
246
7 to 4
0
2000
4000
6000
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106
113
120
127
134
141
148
155
162
169
176
183
190
197
204
211
218
225
232
239
246
4 to 5
0
5000
10000
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106
113
120
127
134
141
148
155
162
169
176
183
190
197
204
211
218
225
232
239
246
East output East input
0
2000
4000
6000
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106
113
120
127
134
141
148
155
162
169
176
183
190
197
204
211
218
225
232
239
246
5 to 4
0
2000
4000
6000
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106
113
120
127
134
141
148
155
162
169
176
183
190
197
204
211
218
225
232
239
246
6 to 4
0
1000
2000
3000
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106
113
120
127
134
141
148
155
162
169
176
183
190
197
204
211
218
225
232
239
246
7 to 4
Figure 2.2: The traffic of router 4 is tracked. The first diagram is all the traffic input/outputfrom router 4. The second to the fourth diagrams show the decomposed traffic. Note thatthe traffic relayed by router 4 is omitted. The last one is the output traffic from router 4 to5
9
Chapter 3
Related Work
In [7], information of a switch is sent to other ones for deciding the routing path
to avoid the congestion. The control information is sent locally and cannot reflect
the statuses of the whole network. The authors predict network congestion based
on their proposed traffic source and router model in [8,9]. By using this model,
each router predicts the availability of its buffer ahead of time, i.e., how many
flits a router can accept currently. The traffic source cannot inject packets
until the availability is greater than zero. They predict traffic from the switch
perspective but our predictions are made from the perspectives of applications.
In [12–15], they consider a congestion control scenario which models flow
control as an utility maximization problem. These works propose an iterative
algorithm as the solution to the maximization problem.
The authors in [10] make use of the operating system (OS) and let the system
software to control the resource usage. In [11] the authors detail a NoC com-
10
munication management scheme based on a centralized, end-to-end flow control
mechanism by monitoring the hardware statuses. All the works above need a
dedicated control NoC to transfer OS-control message and a data NoC which is
responsible for delivering data packets. The OS refers to the blocked messages of
the local processing element to limit the time wherein the element is able to send
messages. In [16], almost the same network architecture is assumed except that
they add some extra hardware to support its distributed HW/SW congestion
control technique.
Model Predictive Control (MPC) is used for on-chip congestion control in
[17]. In this work, link utilization of a router is used as the indication for
the congestion measurement. In contrast, our work makes predictions from
the application-layer rather in the link-layer in order to obtain the transmission
behaviors of the running applications. We claim that these behaviors are actually
the main reason which brings about the network congestion.
11
Chapter 4
Problem Formulation
We have already known that congestion might degrade network performance
considerably. So congestion in the network should be avoided as possible as we
can. In [18], the queueing delay is used as one metric of congestion detection.
In [17], the authors use link utilization as congestion measure. Since there is no
official universally accepted definition of network congestion [19], we take link
utilization as congestion measure in this thesis. The utilization of a link ei at the
t-th time interval is defined as:
Utili(t) =Di(t)
T × W0 ≤ Utili ≤ 1
where Di(t) denotes the total data size transmitted by ei at the t-th time interval.
The period of a time interval is defined as T seconds and W is the maximum
bandwidth of a communication link. Thus T ×W denotes the maximum possible
data size transmitted in one time interval.
We make an assumption that if the link utilization of a given link in the
12
network exceeds a properly selected threshold Th, this link is congested. Ex-
perimental results in [17] asserts that 80 % link utilization results in reasonable
latencies before the congestion limit. However, this selected threshold value
should take some hardware configurations in to consideration such as the buffer
size and the link bandwidth.
We hope to prevent network from being congested before it happens. This
prospect is achieved by predicting possible traffic at the t-th time interval. We
hope to prevent several traffic sources from injecting packets concurrently. By
scheduling the packet injection effectively we can avoid network congestion and
then improve the average packet latency. Latency is a commonly used perfor-
mance metric and can be interpreted in different ways [3]. We define latency
here as the time elapsed from the message header is injected into the network
at the source node to the tail of packet is received at the destination node.
Assume that λ is the average packet latency and texec is the total execution
time without doing any flow control. λ′ is the average packet latency and t′exec
is the total execution time with our proposed flow control. Our goal is to max-
imize λ − λ′ and texec − t′exec. However, the execution time is affected by the
communication dependencies between traffic sources [20]. This will require fur-
ther discussion about dependencies in the program which is beyond the scope
of this thesis.
13
Cros
s-ba
r
Output 0
Output 1
Output 2
Output 3
Output 4
Input 0
Input 1
Input 2
Input 3
Input 4To local processorFrom local processor
To the N. router
To the W. router
To the S. router
To the E. router
From the N. router
From the W. router
From the S. router
From the E. router
Figure 4.1: The structure of a router
Dest. LRU Data Size G4 G3 G2 G1 G0
5 0 256 5 3 1 2 4
8 2 128 3 3 0 3 3
10 1 512 2 2 2 2 2
13 3 64 5 4 3 5 4
Transmission history
Figure 4.2: A example of a L1-table. The columns G4 : G0 record the quantized transmitteddata size of the last 5 time intervals.
4.1 Application-Driven Predictor
In this subsection, we show that how to predict traffic by using a table-driven
network traffic predictor and make traffic control decisions with an extra ta-
ble which records the delayed transmissions. This original prediction method
is proposed in [1]; however, they only discuss how to monitor and predict the
traffic without interfering in the traffic. In this thesis, the future transmissions
14
G4 G3 G2 G1 G0 LRU Gp
5 3 1 2 4 31 2
4 4 0 4 4 13 4
5 4 2 5 3 5 0
3 1 2 6 3 12 2
…
Indexed by L1-table
Figure 4.3: A example of a L2-table which is indexed by the transmission history patternG4 : G0. The corresponding data size level Gp is the value predicted to transmit in the nexttime interval
Src. Dest. Data size Priority
9 10 256 3
4 3 64 2
3 12 32 0
5 6 16 0
…
Figure 4.4: A table which records the delayed transmissions
15
are deeply controlled by our extended design. In order to simplify the following
discussion, we assume the 2D mesh network as the underlying topology, and the
size of the mesh network is N × N . Note that our approach is independent of
the topology and the size of the network, so it can be easily extended to other
network topology and arbitrary size of network. Each tile consists of a processor
core, a memory module and a router. We assume that the router has 5 input
and 5 output ports and a 5 × 5 crossbar. The structure of a router is shown in
Figure 4.1. Each crossbar contains five connections: east, north, west, south and
the local processor. Each connection consists of two uni-directional communi-
cation links for sending and receiving data, respectively. Deterministic routing
algorithm is assumed so that the path between a source and a destination is
determined in advance. This is the most common type of the routing algorithms
in the current NoC implementations.
A table-driven predictor is employed to record the traffic of the past history
and then we make use of the history to predict the data size and the destina-
tion of the outgoing traffic from each router in the next time interval. Each
router maintains two hierarchical tables for tracking and predicting the data
transmission. The first level table (L1-table) as shown in Figure 4.2 tracks all
output data transmissions. Each router here uses only four entries to record
transmission destination since a core may only communicate with a subset of
16
all the cores [1]. The destination entry can be replaced by the LRU replace-
ment policy for reducing the size of the table. In order to map the patterns to
guess the following transmission, a second-level table (L2-table) is required. At
the beginning of the t-th time interval, the transmission history recorded in the
L1-table is used to index the L2-table to get the predicted level of the trans-
mission data size at the t-th time interval. At the end of the t-th time interval,
when an output transmission is issued by the processor core, the destination
and data size are recorded in L1-table. The data size is quantized and recorded
in G0. The columns from G0 to Gn records the quantized transmitted data size
of the last n + 1 time intervals. The two tables are updated at the end of each
predefined time interval. After checking the prediction, the value of the data
size counter in the L1-table is quantized and shift into G0. Finally, the updated
transmission history in the L1-table is used to index the L2-table and retrieve
the predicted data size level that will be transmitted in the next time interval. If
the transmission history can not be found in the L2-table, the system will either
create a new entry or replace the existing entry by LRU in the L2-table, and use
the last value (G0) as the predicted transmission data size level. The recorded
transmitted data size levels in the L1-table are used to check the accuracy of the
prediction made at the last time interval. If the prediction was wrong, the value
of Gp at the L2-table for the corresponding transmission history pattern will be
17
modified to the data size level recorded in L1-table.
Besides the traffic predictor, we need to maintain another table to record
the delayed transmission, as shown in Table 4.4. As the traffic control algorithm
decides to delay a transmission, we need to record the source and the destination
and the traffic size. In order to avoid starvation, we need to add the priority
column. As the transmission is determined to delay for another interval, the
value in the priority column is also increased.
18
Chapter 5
Traffic Control Algorithm
In this chapter, we present a heuristic algorithm for NoC traffic management.
Then, we give some possible solutions to aggregate the prediction data.
5.1 Traffic Control Algorithm and Implementation
Overhead
The algorithm, detailed in Algorithm 1 is for a central control system and the
algorithm detailed in Algorithm 2 is for each node. This control system needs to
maintain two tables: one is to record those transmissions which are delayed and
another table is to record those transmissions which are predicted. Upon receiv-
ing these transmissions, this control system has to decide which transmission
should be delayed and which transmission should be injected. The control mes-
sage inject sent from control system to each node to decide whether the source
node i can inject or not into the destination node j in the next time interval.
Noticeably, Algorithm 1 is executed at the beginning of the time interval and
19
then Algorithm 2 is executed between the time interval.
Because this flow control algorithm is from the end-to-end layer, we use inject
to indicate if source i can send packets to destination j. Figure 5.1 is a simple
flow chart to explain our flow control algorithm.
At the beginning, we assume that each source can send traffic to each desti-
nation (line 3). Then, the algorithm will have to decide which transmission in
the delay table can inject (line 5 - 22). Each transmission has its own priority
to avoid starvation. (line 6). The transmission with the highest priority is the
one which has the longest delay. The workload (line 10) includes the workload
which has not finished processing before and the workload which may inject in
the next time interval. If the workload of any link exceeds the threshold value,
the control signal should be set false (line 11). The threshold value depends on
the architecture. After deciding which transmission in the delay table should
inject, the remaining transmissions should update their priority (line 23). The
control system collect transmissions which are predicted to inject in the next
time interval from the predictor and decide that the control signal value should
be true or false.
Algorithm 2 is executed in each source node between a time interval. Every
source node receives the control message from the control system and makes
decisions (line 1). When there is transmission from source i to destination j,
20
Figure 5.1: The diagram of the flow control algorithm
if the control message value is true, it means that the source node is allowed
to inject traffic onto the network; on the opposite, the source node should not
inject any traffic and add this transmission to the centralize delay table.
To deserve to be mentioned, the algorithms presented here are just a example
for flow control as we know how to predict NoC traffic. There may be other
available algorithms to solve flow control problems.
5.2 Data Aggregation
Figure 5.2 is the basic idea of our proposed method. The control system is re-
sponsible for Algorithm 1 and each node is responsible for Algorithm 2. The
21
Algorithm 1 Algorithm for central control system
1: // Initialization. inject[src][dest] is a control message to decide injecting or not.2: for all source-to-destination transmission pairs do3: inject[src][dest] = true;4: end for5: for all transmissions in the delay table do6: Selecting the transmission Tdelay i,j with the highest priority;7: Let path be the routing path of Tdelay i,j
8: if injected[i][j] == true then9: for all link ε path do
10: if link.workload ¿ threshold then11: inject[i][j] = false;12: break;13: end if14: // send the control message to the nodes15: if inject[i][j] == true then16: Sending injecting notification to node i to inject Tdelay i,j;17: Updating link.workload;18: Deleting Tdelay i,j from delay table19: end if20: end for21: end if22: end for23: Updating delay table for priority;24: Collecting predicted transmissions from application-driven predictor;25: for all predicted transmissions do26: Selecting the transmission Tpredict i,j with the highest priority;27: if inject[i][j] == true then28: for all link ε path do29: if link.workload ¿ threshold then30: inject[i][j] = false;31: break;32: if inject[i][j] == true then33: Updating link.workload;34: Deleting Tpredict i,j from the predicted transmissions35: end if36: end if37: end for38: end if39: end for
22
Algorithm 2 Algorithm for each node i
1: receive the control message;2: if there is a transmission to destination j then3: if inject[i][j] == true then4: inject;5: else6: Adding the transmission to the centralized delay table;7: end if8: end if9: Updating application-driven predictor;
10: Updating link.workload;
control system is bound to send control signal to each node via control net-
work and each node needs to send some information via control network to the
control system to help the control’s system make decision. Each node communi-
cates with each other by data network. In [10], the authors think the operating
system is capable of network traffic management. For this reason, our method
can be adopted to the architecture platform mentioned in [10] and the control
system can be seen as the operating system. However, this method may be too
troublesome so we propose an alternative. Since there are many existing cores,
we can use a dedicated core to handle the flow control decision. This dedicated
core stands for the control system in Figure 5.2.
5.3 Area Occupancy
Then, we analyze the area overhead of the NTPT. In this subsection, we use the
number of transistors in real manycore design. In UC Davis AsAP, it has 55M
transistors, and in Tileras TILE64, the number of transistors is 615M. Assuming
23
control signalCores Data Network
update information in control system
cconon
Application-Driven Traffic Predictor
(via control network)
(via control network)
Control System
Figure 5.2: The diagram of flow control
that each bit needs 6 transistors, in our design the application-driven predictor
needs 0.69M transistors when the number of cores is 64. And because we need to
maintain another table named control table to record the delayed transmissions
and here we assume that the number of entry is 128 and the needed transistors
are about 0.02M. The application-driven predictor occupies 1.29% and 0.12% in
AsAP and TILE64, respectively. It is quite small and tolerable area overhead.
However, [21] addresses that an increase of the data path width by 138% results
in an area penalty of 64% in Xpipe, which is a NoC architecture. The area
overhead is extremely considerable. The average packet latency changes from 49
cycles to 39 cycles as the link bandwidth enlarges from 2.2 GB/s to 3.2 GB/s. In
short, the average packet latency improves slightly as the link bandwidth enlarges
and results in huge area overhead. This conclusion gives us the motivation to do
the inject rate flow control since increasing the link bandwidth is not economic.
24
Chapter 6
Experimental Results
In this chapter, we will demonstrate the experimental result to evaluate our
proposed flow control algorithm. We adopt both real application traffic and
synthetic traffic in our experiment.
6.1 Simulation Setup
The PoPNet network simulator [22] is used for our simulations and the data
transmission traces are used as the input of the simulator. The data transmission
traces record the packet injection time, the address of the source router, the
address of the destination router and the packet size. The detailed configuration
of simulation is provided in Table 6.1. The original data transmission traces are
altered by our flow control algorithm, and this results in that some transmissions
are delayed for some period so as to avoid congestion. The experimental results
presented in the following show that our algorithm exhibits huge performance
improvement.
25
Table 6.1: Simulation Configuration
Network Topology mesh 4x4
Virtual Channel 3
Buffer Size 12
Routing Algorithm x-y routing
Bandwidth 32 byte
6.2 Real Application Traffic
The Tilera’s TILE64 platform is used to run the benchmark programs and collect
the data transmission traces. We use SPLASH-2 blocked LU decomposition
as our benchmark program. The total workload is 3991 packets. As shown
in Table 6.2, the average packet latency drops from 2410.79 cycles to 771.858
cycles and the maximum packet latency drops from 5332 cycles to 3242 cycles.
The significant performance improvement origins from that we predict traffic
workload in the next interval and delay some packet injection to avoid congestion.
As depicted in Figure 6.1 (a), the packet latencies without flow control range
between 0 cycles and 5500 cycles. However, with our proposed flow control
algorithm, the packet latencies range between 0 cycles and 3300 cycles. These
packet latencies have decreased violently so that the histogram shifts to the left
side. To bear up our conviction, Figure 6.2 demonstrates more details about
26
Original Pattern-oriented
Reduction
Ave. latency 2410.79 cycles 771.858 cycles 3.12
Max. latency 5332 cycles 3242 cycles 1.64
Simulation Cycle
5600 cycles 6100 cycles 0.92
Table 6.2: Our proposed flow control algorithm leads to the huge reduction in the latencyand slight execution time overhead.
the network congestion. We set the congestion threshold as 40 flits. The line in
the Figure 6.2 (b) goes above the threshold is because of the wrong predictions
of network traffic. However, the impact of miss prediction is slight so that the
result is under our acceptable scope. In Figure 6.2 (a) without flow control,
the maximum workload is far apart from the threshold, and consequently causes
severe network congestion.
6.3 Synthetic Traffic
Besides the real application traffic, we also extend our algorithm for synthetic
traffic. In [20], the authors state that injected network traffic possesses self-
similar temporal properties. They use a single parameter, the Hurst exponent H
to capture temporal burstiness characteristic of NoC traffic. Based on this traffic
model, we synthesize our traffic traces. In Table 6.3, we give some instances
27
0
20
40
60
80
100
120
140
160
Num
ber o
f pac
kets
Packet Latency (cycles)
Histogram of the packet latencies
(a)
0
50
100
150
200
250
300
350
400
100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900
2000
2100
2200
2300
2400
2500
2600
2700
2800
2900
3000
3100
Num
ber o
f pac
kets
Packet Latency (cycles)
Histogram of the packet latencies
(b)
Figure 6.1: Histograms of the packet latencies without (a) and with (b) the proposed flowcontrol and in (b) the latencies slow down drastically.
28
0
500
1000
1500
2000
2500
3000
3500
4000
4500
1 9 17 25 33 41 49 57 65 73 81 89 97 105
113
121
129
137
145
153
161
169
177
185
193
201
209
217
225
233
241
249
257
265
273
281
289
297
305
313
Max
wor
kloa
d of
link
(flit
s)
Time (cycles)
original
0
10
20
30
40
50
60
70
1 10 19 28 37 46 55 64 73 82 91 100
109
118
127
136
145
154
163
172
181
190
199
208
217
226
235
244
253
262
271
280
289
298
307
316
Max
wor
kloa
d of
link
(flit
s)
Time (cycles)
pattern-oriented
Figure 6.2: The maximum workload of links in the network without (a) and with (b) theproposed flow control.
29
of different parameters H and make some comparisons. These parameters are
chosen based on [20]. Table 1 in [20] shows some values of Hurst exponent H and
in our thesis we choose some values among them as a matter of convenience. The
average packet latency and the maximum latency both drop down significantly.
Besides, the execution time with our proposed flow control is a little better than
that which is without flow control. Relatively large H values indicate highly
self-similar traffic and higher traffic prediction accuracy rate. But because the
average packet size also increases with H, the reduction does not arise linearly
with H
30
H 0.576 0.661 0.768 0.855 0.978
OriginalAve. latency
3553.14 cycles
3596.45 cycles
3649.21 cycles
3665.53 cycles
3614.56 cycles
Improved Ave. latency
482.512 cycles
467.787cycles
387.716 cycles
412.983 cycles
417.577 cycles
Reductionof Ave. latency
7.364 7.688 9.412 8.876 8.656
Original Max.
latency
7623 cycles
7623cycles
7710 cycles
7658 cycles
7714 cycles
Improved Max.
latency
1591 cycles
1532 cycles
1016 cycles
1054 cycles
1037 cycles
Reduction of Max. latency
4.791 4.976 7.589 7.266 7.438
Original Simulation
Cycle
8580 cycles
8510 cycles
8550 cycles
8480 cycles
8450 cycles
Improved Simulation
Cycle
8280 cycles
8260 cycles
7690 cycles
7781 cycles
7731 cycles
Table 6.3: Our proposed flow control algorithm for synthetic traffic leads to the huge reduc-tion in the average latency and the maximum latency and slight reduction in the executiontime.
31
Chapter 7
Conclusion and Future Works
Our thesis proposes an application-oriented flow control for packet-switched
networks-on-chip. By tracking and predicting the end-to-end transmission be-
havior of the running applications, we can limit the traffic injection when the
network is heavily loaded. By delaying some transmissions efficiently, the aver-
age packet latency can be decreased significantly so that the performance can
be improved obviously. In our experiments, we adopt real application traffic
traces as well as synthetic traffic traces. The experimental result shows that
our proposed flow control not only decreases the average packet latency and the
maximum latency, but under some condition the execution time can even be
shortened.
Future work will focus on improving the accuracy of the application-oriented
traffic prediction. Also, the simulation configuration should be further discussed.
Determining the optimal parameter and adjusting flow control algorithm are
32
also important. Besides, we ignore the communication dependencies between
the traffic traces because there is difficulty in considering about this issue.
33
Bibliography
[1] Y. S.-C. Huang, C.-K. Chou, C.-T. King, and S.-Y. Tseng, “Ntpt: On the
end-to-end traffic prediction in the on-chip networks”, in Proc. 47th ACM
IEEE Design Automation Conference, 2010.
[2] S. Bell, B. Edwards, J. Amann, R. Conlin, K. Joyce, V. Leung, J. MacKay,
M. Reif, Liewei Bao, J. Brown, M. Mattina, Chyi-Chang Miao, C. Ramey,
D. Wentzlaff, W. Anderson, E. Berger, N. Fairbanks, D. Khan, F. Montene-
gro, J. Stickney, and J. Zook, “Tile64 - processor: A 64-core soc with mesh
interconnect”, in Proc. Digest of Technical Papers. IEEE International
Solid-State Circuits Conference ISSCC 2008, Feb. 3–7, 2008, pp. 88–598.
[3] Jose Duato, Sudhakar Yalmanchili, and Lionel Ni, “Interconnection net-
works”, 2002, pp. 428–431.
[4] S. Mascolo, “Classical control theory for congestion avoidance in high-speed
internet”, in Proc. Decision and Control Conference, 1999.
34
[5] Cui-Qing Yang, “A taxonomy for congestion control algorithms in packet
switching networks”, in IEEE Network, 1995.
[6] Hua Yongru Gu, Wang Hua O., and Hong Yiguang, “A predictive conges-
tion control algorithm for high speed communication networks”, in Proc.
American Control Conference, 2001.
[7] Erland Nillson, Mikael Millberg, Johnny Oberg, and Axel Jantsch, “Load
distribution with the proximity congestion awareness in a network on chip”,
in Proc. Design, Automation, and Test in Europe, 2003, p. 11126.
[8] U. Y. Ogras and R. Marculescu, “Prediction-based flow control for network-
on-chip traffic”, in Proc. 43rd ACM IEEE Design Automation Conference,
2006, pp. 839–844.
[9] U. Y. Ogras and R. Marculescu, “Analysis and optimization of prediction-
based flow control in networks-on-chip”, in ACM Transactions on Design
Automation of Electronic Systems, 2008.
[10] Vincent Nollet, Theodore. Marescaux, and Diederik Verkest, “Operating-
system controlled network on chip”, in Proc. 41st ACM IEEE Deaign
Automation Conference, 2004.
35
[11] P. Avasare, J-Y. Nollet, D. Verkest, and H. Corporaal, “Centralized end-
to-end flow control in a best-effort network-on-chip”, in Proc. 5th ACM
internatinoal conference on Embedded software, 2005.
[12] Mohammad S. Talebi, Fahimeh Jafari, and Ahmad Khonsari, “A novel
congestion control scheme for elastic flows in network-on-chip based on sum-
rate optimization”, in ICCSA, 2007.
[13] M. S. Talebi, F. Jafari, and A. Khonsari, “A novel flow control scheme
for best effort traffic in noc based on source rate utility maximization”, in
MASCOTs, 2007.
[14] Mohammad S. Talebi, Fahimeh Jafari, Ahmad Khonsari, and Mohammad H.
Yaghmaeem, “Best effort flow control in network-on-chip”, in CSICC, 2008.
[15] Fahimeh Jafari, Mohammad S. Talebi, Mohammad H. Yaghmaee, Ahmad
Khonsari, and Mohamed Ould-Khaoua, “Throughput-fairness tradeoff in
best effort flow control for on-chip architectures”, in Proc. 2009 IEEE
International Symposium on Parallel and Distributed Processing, 2009.
[16] T. Marescaux, A. Rangevall, V. Nollet, A. Bartic, and H. Corporaal, “Dis-
tributed congestion control for packet switched networks on chip”, in
ParCo, 2005.
36
[17] J.W. van den Brand, C. Ciordas, K. Goossens, and T. Basten, “Congestion-
controlled best-effort communication for networks-on-chip”, in Proc. De-
sign, Automation, and Test in Europe, 2007.
[18] Jin Yuho, Yum Ki Hwan, and Kim Eun Jung, “Adaptive data compression
for high-performance low-power on-chip networks”, in Proc. 41st annual
IEEE/ACM International Symposium on Microarchitecture, 2008.
[19] Keshav Srinivasan, “Congestion control in computer networks”, 1991.
[20] Vassos Soteriou, Hangsheng Wang, and Li-Shiuan Peh, “A statistical traffic
model for on-chip interconnection networks”, in Proc. 14th IEEE Interna-
tional Symposium on Modeling, Analysis, and Simulation, 2006.
[21] Anthony Leroy, “Optimizing the on-chip communication architecture of low
power systems-on-chip in deep sub-micron technology”, 2006.
[22] N. Agarwal, T. Krishna, L. Peh, and N. Jha, “Garnet: A detailed on-chip
network model inside a full-system simulator”, in Proceedings of Inter-
national Symposium on Performance Analysis of Systems and Software,
2009.
37