parallelized and efficient implementations of rate ... and efficient implementations of rate...
TRANSCRIPT
Parallelized and Efficient
Implementations of Rate Dematching and Channel Deinterleaver
on a Symmetrical Multiprocessor Architecture
A R A S H B A Z R A F S H A N
Master of Science Thesis Stockholm, Sweden 2010
Parallelized and Efficient
Implementations of Rate Dematching and Channel Deinterleaver
on a Symmetrical Multiprocessor Architecture
A R A S H B A Z R A F S H A N
Master’s Thesis in Computer Science (30 ECTS credits) at the School of Computer Science and Engineering Royal Institute of Technology year 2010 Supervisor at CSC was Stefan Nilsson Examiner was Stefan Arnborg TRITA-CSC-E 2010:169 ISRN-KTH/CSC/E--10/169--SE ISSN-1653-5715 Royal Institute of Technology School of Computer Science and Communication KTH CSC SE-100 44 Stockholm, Sweden URL: www.kth.se/csc
Praise be to God, Master of the Universe
Parallelized and Efficient Implementations of Rate dematching and Channel deinterleaver on a Symmetrical Multiprocessor Architecture
Abstract
In this master’s thesis parallelized implementations of Rate dematching and Channel
deinterleaver have been written for the symmetrical multiprocessor (SMP) architecture
used on Ericsson AB’s base stations that are designed for the new standard LTE in
mobile communication. This report reviews the capabilities of the SMP architecture, and
states how critical loops of a common kind should be written for the architecture so that
their execution times can not be further reduced and thus are optimal. The report then
describes how the parallelized implementations that have been written work and how
they have been optimized, and it also suggests how they should be further optimized to
reduce execution time. Some hardware changes to the SMP architecture are also
proposed for consideration. The report also reviews the efficiency of assembler code
produced by the compiler that compiles implementations that are to run on the SMP
architecture. The conclusion is that the compiler uses the SMP architecture’s capabilities
inadequately.
Paralleliserade samt effektiva implementationer av Rate dematching och Channel deinterleaver för en symmetrisk multiprocessorsarkitektur
Sammanfattning
I detta examensarbete har paralleliserade implementationer av Rate demtaching och
Channel deinterleaver skrivits för den symmetriska multiprocessorarkitektur (SMP
arkitektur) som används på de basstationer som Ericsson AB har utvecklat för den nya
standarden LTE i mobil kommunikation. Denna rapport undersöker SMP arkitekturens
funktionaliteter samt förklarar hur kritiska loopar av en sort som är normalt
förekommande ska skrivas för arkitekturen så att deras exekveringstid kan inte reduceras
mer. Därav blir looparna optimala. Rapporten förklarar sedan hur de paralleliserade
implementationerna som har skrivits fungerar och hur de har optimerats, och det föreslås
också hur de ska vidare optimeras. Det ges också förslag på hårdvaruändringar för SMP
arkitekturen. Rapporten undersöker också effektiviteten av assemblerkod som den
kompilator som används för att kompilera implementationer som ska exekveras av SMP
arkitekturen. Slutsatsen är att kompilatorn använder SMP arkitekturens funktionaliteter
otillräckligt.
TABLE OF CONTENTS
1 Introduction ................................................................................................................... 1
1.1 Background information ......................................................................................... 1
1.1.1 Network of base stations.................................................................................. 1
1.1.2 Mobile communication standards and 3GPP................................................... 2
1.1.3 LTE and Ericsson ............................................................................................ 2
1.1.4 Processing steps performed on the mobile device ........................................... 2
1.1.5 Processing steps performed on the base station. ............................................. 4
1.2 Specification of thesis ............................................................................................. 5
1.3 Reading guidelines ................................................................................................. 5
1.3.1 Prior knowledge required by the reader .......................................................... 6
1.3.2 Sensitive information regarding EMPA .......................................................... 6
1.3.3 The report’s content ......................................................................................... 8
1.3.3.1 Chapters 1 and 2 ....................................................................................... 8
1.3.3.2 Chapters 3 and 4 ....................................................................................... 8
1.3.3.3 Chapter 5 .................................................................................................. 8
1.3.3.4 Chapters 6, 7 and 8 ................................................................................... 8
1.3.3.5 Appendices ............................................................................................... 9
1.3.4 Mathematical notations used ........................................................................... 9
1.3.4.1 Constants and variables ............................................................................ 9
1.3.4.2 Arrays ....................................................................................................... 9
1.3.4.3 Matrices .................................................................................................. 10
1.3.5 Abbreviations ................................................................................................ 11
1.3.6 Expressions .................................................................................................... 11
2 EMPA .......................................................................................................................... 16
2.1 The Digital Signal Processors .............................................................................. 16
2.1.1 The DSP’s LDM and LPM ............................................................................ 17
2.1.2 The DSP’s registers ....................................................................................... 18
2.1.2.1 Temporary registers ................................................................................ 18
2.1.2.2 Address registers .................................................................................... 18
2.1.2.3 Offset registers ........................................................................................ 18
2.1.3 Instruction execution on the DSP .................................................................. 18
2.1.3.1 The DSP’s instruction pipeline ............................................................... 18
2.1.3.2 Executing parallel instructions on the DSP ............................................ 19
2.1.3.3 LDM instructions and their capabilities ................................................. 20
2.1.4 Hardware support for executing loops on the DSP ....................................... 21
2.2 The Common Memory ......................................................................................... 21
2.3 The Command Bus ............................................................................................... 22
3 Execution methodologies used on EMPA ................................................................... 23
3.1 Locks .................................................................................................................... 23
3.2 To dispatch a JOB ................................................................................................ 23
3.3 Memory allocation ................................................................................................ 27
3.4 Barrier synchronization between the job DSPs .................................................... 27
4 Definition of an optimal loop ...................................................................................... 28
5 Specification of the processing steps ........................................................................... 32
5.1 Overview of the Uplink Transport Channel ......................................................... 32
5.2 Channel deinterleaver ........................................................................................... 33
5.2.1 Specification of Channel interleaver ............................................................. 33
5.2.2 Specification of Channel deinterleaver.......................................................... 36
5.3 Rate dematching ................................................................................................... 38
5.3.1 Specification of Rate matching...................................................................... 38
5.3.1.1 Padding ................................................................................................... 40
5.3.1.2 Permutation I .......................................................................................... 41
5.3.1.3 permutation II ......................................................................................... 42
5.3.1.4 Bit collection .......................................................................................... 43
5.3.1.5 Bit selection ............................................................................................ 44
5.3.2 Specification of Rate dematching .................................................................. 44
6 Processing algorithm for Channel deinterleaver ......................................................... 47
6.1 Initial discussion ................................................................................................... 47
6.1.1 Regarding existing literature on matrix transposition ................................... 47
6.1.2 The q symbols ................................................................................................ 48
6.2 Suggestion for processing algorithm .................................................................... 48
6.2.1 How to parallelize .......................................................................................... 48
6.2.2 Clearing the q symbols .................................................................................. 56
6.3 Description of the processing algorithm ............................................................... 57
7 Processing algorithm for Rate dematching .................................................................. 60
7.1 Initial discussion ................................................................................................... 60
7.1.1 Redefinition of Permutation II ....................................................................... 60
7.1.2 Bit collection seen from a new perspective ................................................... 63
7.1.3 Bit selection seen from a new perspective..................................................... 63
7.1.4 Rate demtaching seen from a new perspective .............................................. 67
7.1.5 Remarks regarding SatFunc and soft combining ........................................... 70
7.1.6 Typical input parameters for Rate demtaching.............................................. 70
7.2 Suggestion for processing algorithm .................................................................... 70
7.2.1 The benefit of clearing U, V and W .............................................................. 70
7.2.2 Efficient soft-combining of multiple matrix repetitions ................................ 71
7.2.3 Working with bytes in words......................................................................... 71
7.2.4 How to parallelize .......................................................................................... 71
7.3 Description of the processing algorithm ............................................................... 75
8 Conclusion ................................................................................................................... 76
8.1 Implementation of Channel deinterleaver ............................................................ 76
8.2 Implementation of Rate dematching..................................................................... 77
8.3 The implementations’ correctness ........................................................................ 78
References ...................................................................................................................... 79
APPENDICES
Appendices ..................................................................................................................... 80
Appendix 1. Tools used to write implementations ......................................................... 81
Appendix 2. Channel deinterleaver ................................................................................ 82
2.1. Specification of the processing algorithm ........................................................... 82
2.1.1. Specification of deint_master ....................................................................... 82
2.1.2. Specification of deint_slave.......................................................................... 83
2.1.3. Specification of deint_section ...................................................................... 85
2.1.4. Specification of deint_segment .................................................................... 86
2.1.5. Specification of deint_column ...................................................................... 87
2.2. Test cases used for writing implementation ........................................................ 88
2.2.1. Test cases for testing correctness.................................................................. 88
2.2.2. Test cases for measuring performance ......................................................... 90
2.2.3. Test cases for measuring performance during optimization ......................... 91
2.3. Implementation of the processing algorithm ....................................................... 91
2.3.1. Performance of the first version of the implementation ............................... 92
2.3.2. Major optimizations performed in C code .................................................... 93
2.3.2.1. Using specific symbol sizes in deint_column ....................................... 93
2.3.2.2. Unrolling deint_column’s loop .............................................................. 94
2.3.2.3. Reading and writing more words per LDM instruction ........................ 96
2.3.2.4. Clearing the q symbols .......................................................................... 96
2.3.3. Performance of the implementation after C optimizations ........................... 98
2.3.4. Optimizations performed in assembler code .............................................. 100
2.3.4.1. Initial changes before optimization ..................................................... 100
2.3.4.2. Description of the assembler optimizations ........................................ 102
2.3.5. Performance of the implementation after assembler optimizations ........... 103
2.4. Future improvements ......................................................................................... 103
2.4.1. Changes of convenience ............................................................................. 103
2.4.2. Using hardware support to execute loops ................................................... 104
2.4.3. Making the critical loop of deint_column_4 optimal ................................. 104
2.4.4. Making the critical loops of deint_column_2 optimal ............................... 106
2.4.5. Making the critical loops of deint_column_6 optimal ............................... 108
Appendix 3. Rate dematching ...................................................................................... 109
3.1. Specification of the processing algorithm ......................................................... 109
3.1.1. Description of byte_array ........................................................................... 109
3.1.2. Description of rd_data ................................................................................ 109
3.1.3. Description of input_reader ........................................................................ 110
3.1.4. Description of column_traverser ................................................................ 110
3.1.5. Specification of dematch ............................................................................ 110
3.1.6. Specification of soft_comb_array ............................................................... 111
3.2. Test cases used for writing implementation ...................................................... 113
3.2.1. Test cases for testing correctness................................................................ 113
3.2.2. Test cases for measuring performance ....................................................... 113
3.3. Implementation of the processing algorithm ..................................................... 114
3.3.1. Major optimization steps performed .......................................................... 115
3.3.2. Description of assembler optimizations ..................................................... 117
3.3.3. The implementation’s performance after optimizations ............................. 118
3.4. Future improvements ......................................................................................... 120
3.4.1. Bypassing the structures ............................................................................. 120
3.4.2. Processing consecutive columns to obtain optimality ................................ 120
Appendix 4. Review of assembler code produced by EMC ......................................... 123
4.1. First critical loop ................................................................................................ 123
4.2. Second critical loop ........................................................................................... 125
4.3. Conclusion ......................................................................................................... 126
Appendix 5. Hardware changes to EMPA.................................................................... 127
5.1. Swapping contents of temporary register parts ................................................. 127
1
1 Introduction This report will present the results of the thesis and explain how they were achieved.
The thesis’ purpose must be stated first, which is done in this chapter. The thesis is
concerned with mobile communication, and so to enable the reader to understand its
purpose some background information is presented first in this chapter.
The chapter is also concerned with helping the reader to understand the report. Reading
guidelines for the report are provided by this chapter, and two lists of special
abbreviations and expressions used in the report are presented. Special mathematical
notations used in the report are also explained.
1.1 Background information The following subparagraphs give a brief introduction to mobile communication in
order to enable the reader to understand the thesis’ purpose. Technical details that are not
directly relevant are avoided and interested readers are referred to the the subparagraphs’
sources, namely [4], [8] and [1]. On the other hand some aspects are relevant and need to
be further elaborated on than what is done here. This will be covered by Chapter 5.
1.1.1 Network of base stations
The underlying system that enables users to communicate via mobile phones and other
mobile devices includes a network of base stations, see Figure 1. When a mobile device
is turned on it registers itself with a base station that is sufficiently close to communicate
with. That base station is now responsible for relaying data to and from the device.
If mobile device A is to transmit data to device B, then the former first sends the data to
the base station that it has registered with. Device B may also have registered with the
same station, in which case the station transmits the data to B. If B on the other hand has
registered with another station then the data is first transmitted to that station, which in
turn relays it to B.
Of course mobile devices may be physically moving during communication. The device
always keeps track of what base stations are in its proximity and registers with a new one
if it finds it more suitable.1
1 This process is known as a handover.
2
Mobile
device A
Mobile
device B
Mobile
device C
Network of base stations
Base station Base station
Data
to B Data
to C
Data
from A
Data
from A
Data from A to C
Figure 1: A description of how a network of base stations enables mobile communication. The mobile devices A
and B have registered with one base station, while C has registered with another. Device A wishes to send data to
B and C, respectively. It is shown how data reaches its destination through the base stations.
1.1.2 Mobile communication standards and 3GPP
To enable mobile communication, a standard is required that enforces protocols on how
the data is to be transmitted. The most widely used standard is the Global System for
Mobile Communications (GSM). Another standard known as 3G was originally
developed in different countries, such as Japan, US and South Korea. They all used
different concepts of the technology WCDMA. Several standard-developing
organizations from different countries were keeping WCDMA standardized in parallel.
This ended when the Third Generation Partnership Project (3GPP) was formed by many
such organizations from all regions of the world. Its initial objective was to maintain and
develop specifications for 3G mobile technology based on WCDMA. Its scope was later
extended to include maintenance of the GSM standard.
1.1.3 LTE and Ericsson
Long Term Evolution (LTE) is the successor of 3G and is currently being specified by
3GPP. The company Ericsson AB is a provider of telecommunication equipment, and is
currently in the work of developing base stations capable of utilizing LTE. Throughout
this report, the division of Ericsson AB that works with LTE is referred to as ELTE. This
is the division where the thesis was performed.
1.1.4 Processing steps performed on the mobile device
Throughout this report, the expression processing step will define what must be done to
some input to produce a desired output. The expression processing algorithm refers to
3
how a processing step may be performed (only the word “step” or “algorithm” is used for
short when it is obvious what the word refers to). The word “implementation” is, as
always, a well defined set of instructions that can be used by a processor to execute an
algorithm. An implementation may use multiple processors simultaneously to execute its
work, meaning it is a parallelized implementation. To clarify with an example, a
processing step may require that some elements are ordered in ascending order. The
processing algorithm can be any sorting algorithm, such as merge sort. An
implementation of merge sort can be used to perform the processing step.
LTE specifies that the data must be processed in a certain way before it is transmitted
from a mobile device to a base station. A certain set of processing steps are performed in
a specific order on the data. The data is the input to the first step and its output is the
input to the next step, and so on. See the upper half of Figure 2. This chain of steps is
called the Uplink Transport Channel (ULTC).2 The output of the final step is transmitted
to the base station.
2 Downlink Transport Channel is the chain of processing steps that must be applied before the data is to be
transmitted from the base station to the mobile device, but this is not relevant for the thesis.
4
Processing
step 1
Processing
step 2
Rate
matching
Channel
interleaver
Processing
step N
Data ready for
transmission
Processing
step N-1Input
Input
Input
Input
Input
Input
Output
Output
Output
Output
Output
Output
Mobile device (ULTC)
Output
Output
Output
Output
Output
OutputInput
Input
Input
Input
Input
Input
Base station
Procesing
step 1
Original
data
Processing
step M
Procssing
step M-1
Rate
dematching
Channel
deinterleaver
Processing
step 2
Data received
over transmission
Transmission from
mobile device to
base station
Original
data
Figure 2: A comparison between the processing steps performed on the mobile device and the base station. The
names of four of the steps that are relevant to the thesis are shown. Dashed lines mean there are more steps
between. ULTC’s processing steps are performed on the mobile device. Another set of processing steps must be
applied to the received transmission on the base station to obtain the original data.
1.1.5 Processing steps performed on the base station.
When ULTC’s processing steps have been applied to the original data, the output of the
last step is transmitted from the mobile device to the base station. The latter must obtain
the original data that was the first input of ULTC’s steps. This is achieved on the base
station by applying another set of processing steps in a specific order to the received
transmission. The lower half of Figure 2 clarifies this.
Some of these steps have been implemented as dedicated hardware components on
ELTE’s base station. The implementations of other steps are executed on
microprocessors. For the latter case, a symmetrical multiprocessor architecture is
provided on the base station. This specific multiprocessor architecture will throughout
this report be referred to as ELTE’s Multiprocessor Architecture (EMPA). EMPA can
execute parallelized implementations. A special compiler is used to compile
implementations that have been written in C code and are to run on EMPA. The compiler
will be referred to as the EMPA Compiler (EMC).
5
1.2 Specification of thesis The following are two processing steps that must be implemented on EMPA to process
received data from mobile devices (Figure 2 shows them):
Channel deinterleaver
Rate dematching
The three main purposes of this thesis are to
write one parallelized implementation for each of the processing steps Channel
deinterleaver and Rate dematching,
improve the implementations’ memory usage and execution time as much as the
duration of the thesis permits, and to
suggest how they can be further improved.
Throughout the work the thesis’ author decided to extend its scope by
suggesting hardware changes to EMPA that may make it easier to write efficient
implementations in the future, and
reviewing the efficiency of the assembler code produced by EMC.
1.3 Reading guidelines The purpose of this report is not only to present the result of the thesis, but also to give a
good understanding of how those results were achieved. This requires a thorough
description of the obvious aspects of the thesis such as
the hardware and the other tools that have been used,
the specifications of the processing steps that the thesis is concerned with,
how those specifications can be seen from a new perspective in order to write
efficient processing algorithms,
a definition of what characterizes an efficient implementation,
the implementations that were written and how they have been improved,
how the implementations can be further improved, and
how the hardware and the tools that have been used can be further improved.
As the reader can see a lot of ground needs to be covered. But in addition the author has
done his best to make it easy to understand the necessary information, arguments and
ideas that are presented throughout this report. Many figures have been used to among
other things clarify concepts and ideas and to give illustrative examples of arguments and
descriptions. Also many boxes of pseudo code are used to additionally clarify nearly all
the algorithms that are described and discussed. The author hopes that the report’s
purpose and these measurements that have been taken justify the large number of pages
that it requires. It is also the author’s sincere wish that the length of the report does not
dismay the reader from reading it.
6
Having said that, some reading guidelines for this report will now be presented. The
following subparagraphs intend to inform the reader about what he can expect from this
report, and to help him to better understand its content.
1.3.1 Prior knowledge required by the reader
The thesis is concerned with computer science, so to understand this report it is required
by the reader to have some prior knowledge within certain fields. When it comes to
computer hardware, one should be familiar with basic concepts such as
primary memory,
secondary memory,
data bus, and
processor.
A deeper understanding of processors is required, but more of their functionalities than
underlying architecture. Specifically the reader should be familiar with key words and
phrases such as
instruction set,
instruction pipeline,
clockfrequency and cycles,
memory accesses, and
registers.
EMPA is as it has been stated a symmetrical multiprocessor architecture. It is believed
that this report describes such architectures in a manner that it can serve as an
introduction to them if the reader is unfamiliar with them. However the thesis is also
concerned with parallel programming and for this the reader should know of concepts
such as workload balancing and various techniques for synchronizations such as locks
and barrier synchronization. One purpose of the thesis is to write parallelized
implementations and the report will describe them extensively. For this the reader should
have basic skills in C and assembler programming. On a final note he should also know
what a compiler is and have a basic understanding of how it works.
1.3.2 Sensitive information regarding EMPA
As it has been said Ericsson AB is currently in the process of developing base stations
capable of utilizing LTE. It is of course the company’s goal to convince customers that
they benefit more from investing in their base stations than the competitors’. It is
believed that certain properties of EMPA’s components will give Ericsson AB
advantages over its competitors in the market. Such properties may not be publicly
revealed, but the problem is that some of them are relevant to the thesis and it is required
that a report that describes the thesis is made publicly available, which is this report. This
paragraph will describe exactly how such properties have been concealed and how it
affects the report’s content. The properties fall into two broad categories. The first are
numbers that in some way describe EMPA’s components. This can for instance be a
7
memory component’s latency or size. The second are properties that simply put are more
than numbers. They describe for instance the layout of one of EMPA’s components or
how access to it is limited in some way.
Properties falling into the first category are simply replaced by named constants. The
constants can then be used in for instance formulas or when describing problems that
were encountered in the thesis. What such a constant’s value represents is of course
explained first time it is used. All of them are also presented in the table in this
paragraph.
Properties falling into the second category can simply not be discussed in any way. In
fact the reader may not even be hinted whether if or not some component of EMPA has
such a property. But it has still been the author’s goal to make sure that the reader’s
understanding of the report is not too severely limited due to this.
Table 1: The table shows constants that are used in the report to conceal some sensitive values regarding
EMPA. The meaning and unit of each constant is presented.
Constant name Meaning Unit
availLDM The amount of memory in KB of a Digital Signal
Processor’s Local Data Memory that can be used
by an implementation.
KiloByte
latreadCM _ The minimum time it takes to initiate a Common
Memory instruction that reads from the Common
Memory
number of
cycles
latwriteCM _ The minimum time it takes to initiate a Common
Memory instruction that writes to the Common
Memory
number of
cycles
speedCM
The maximum amount of data that can be read
from or written to the Common Memory by a
Common Memory instruction once the
instruction has been initiated.
number of
words
sizeCM The size of the Common Memory. MegaByte
nrINSTRPAR _ The maximum number of instructions that a
Digital Signal Processor can execute in parallel.
number of
instructions
nrLDMWITHPAR __ The maximum number of instructions that a
Digital Signal Processor can execute in parallel
with two Local Data Memory instructions.
number of
instructions
bitsPAYLOADMSG _ The size of the payload of messages that are
transferred over the Command Bus. Bit
cyclesTIMEMSG _ The amount of time it takes from that a process
sends a message over the Command Bus until the
receiving process obtains it.
number of
cycles
8
1.3.3 The report’s content
The following subparagraphs give an overview of the report’s content. The purpose of
each chapter and how the chapter relates to other chapters are described.
1.3.3.1 Chapters 1 and 2
The rest of this chapter continues providing guidelines for this report. Next the aspects
of EMPA that are relevant to the thesis are thoroughly described by Chapter 2. This is
done by describing EMPA component by component and explaining how they are related
to one another. It is some of these component’s properties that can not be publicly
revealed, due to the reasons stated in Paragraph 1.3.2.
1.3.3.2 Chapters 3 and 4
There are some rules and execution methodologies that an implementation that is to run
on EMPA must apply to. These must be understood before it is discussed how any
processing step can be implemented. Chapter 3 covers this topic.
Having covered the relevant aspects of EMPA, it would be good to know if an
implementation utilizes its capabilities maximally and is thus optimal. But one can hardly
claim to have written an implementation that can no longer be further improved. Even
judging this is highly subjective. However, Chapter 4 shows that this sort of judgment
can be cast on a certain type of loops that are common in implementations that are to run
on EMPA. The chapter defines optimality for such a loop by stating a criterion that must
be met by it. This criterion is later used when discussing the critical loops of the
implementations written in the thesis. It is also used to suggest hardware changes to
EMPA that can make it easier to write such loops that are optimal.
1.3.3.3 Chapter 5
The specification of the processing steps that are to be implemented in the thesis must
be thoroughly understood. This becomes easier if one also understands ULTC and some
of its processing steps that are applied on the mobile device (see Paragrah 1.1.4).
Therefore Chapter 5 describes ULTC in more detail and specifies all the processing steps
that are relevant to the thesis.
1.3.3.4 Chapters 6, 7 and 8
Chapter 6 and 7 are the heart of this report. Each of them is devoted to one of the
processing steps, and suggests a processing algorithm that is believed to be efficient. The
report also specifies the algorithms in further detail and discusses how they were
implemented. But this is done by Appendix 2 and Appendix 3. The appendices also
discuss how the implementations performance and correction were verified.
Chapter 8 concludes the report by summarizing its results.
9
1.3.3.5 Appendices
Appendix 1 describes the tools that were used to write implementations, prior to
Appendix 2 and Appendix 3 that among other things describe the processing algorithms
and how they were implemented. Appendix 4 discusses the efficiency of the assembler
code produced by the compiler EMC. Specifically how the compiler makes use of
EMPA’s capabilities when it compiles some simple critical loops that were written in C
code are reviewed. Appendix 5 suggests some hardware changes to EMPA that might
make it easier to write efficient implementations.
1.3.4 Mathematical notations used
Variables, constants, arrays and matrices will be extensively referred to throughout this
report. This paragraph clarifies exactly how they are used and their notations.
1.3.4.1 Constants and variables
A constant or variable is as always a representation of a value. This may be for instance
an integer or a real number. A constant that is introduced will represent the same value
through the rest of the report. For instance it may be used in an algorithm, or can
represent some characteristic of a hardware component. The value of a variable on the
other hand may vary in the context that it is used. For instance it can be the input
parameter of an algorithm or function. Some literatures make a difference between the
two by using capital letters when naming a constant. However this is not the case for this
report.
A constant or variable can be given a name with multiple characters. Because of this
when performing multiplication in formulas, the character “*” will always be used to
avoid confusion. When a sentence is concerned with the constant or variable itself as a
noun, it simply uses its name. But sometimes it is the value that is of interest. To avoid
confusion in the latter case, the name is prefixed by “#”. On a final note a variable that is
a number can be cleared, meaning that its value is set to zero.
The following sentences give examples of these policies.3
nrCAT is an integer variable that denotes the number of cats, while 2.839weightCAT is a
constant that denotes every cat’s weight in gram. So there are # nrCAT cats, and each one
of them weights # weightCAT grams. Together they weight weightnr CATCAT * grams.
1.3.4.2 Arrays
An array is an ordered sequence of elements, where all elements are of the same type.
For instance there can be an array of integers or an array of characters, but there can not
3 Besides from this example, cats have absolutely no relevance to the thesis.
10
be an array that contains both integers and characters. Also the elements can be either
variables or constants, but not both. The elements are indexed by integers from zero in
ascending order of appearance in the sequence. An array named A that has n elements is
denoted by 10 nA . The element of A that has index ni 0 is denoted by iA .
A subarray of A is an array of its own right. It consists of (i) all of A’s elements that are
between two specified elements and (ii) the two specified elements. The subarray’s
elements have the same order as they have in A. The subarray that spans from element
iA to kA is denoted by kiA if it is important to stress that it is a subarray. In such
case the elements maintain the index values that they have in A. But if it is more
interesting to regard it as an array in its own right, then it can also be given a new name,
say B. In such case it is denoted by ikB 0 as any array would. Now the elements are
given new index values.
One can clear an array that contains number variables, which means that every variable
is set to zero.
1.3.4.3 Matrices
A matrix is as always a rectangular grid of ordered horizontal rows and vertical
columns, and in each row and column there is an element. All the elements are of the
same type and are either variables or constants. The rows are indexed from zero in
ascending order of appearance, and so are the columns. The orientation of a matrix is
such that the topmost row has index zero, and the leftmost column has index zero. The
combination of any row’s index coupled with any column’s index effectively becomes
the index of the element that is located in that row and column. A matrix M with #r rows
and #c columns is denoted by 10,10 crM . The element of row s and column t is
denoted by tsM , .
Some task can be performed on every element of a matrix, such as reading them or
changing them in some way. When it is said that a task is performed on a matrix row by
row, it means that the task is performed on every element one row at a time starting from
the topmost row to the bottommost row. The task is performed on the elements of each
row from left to right. Likewise when a task is performed on a matrix column by column,
it means that the task is performed on every element one column at a time starting from
the leftmost column to the rightmost. The task is performed on the elements of each
column from top to bottom. For instance a matrix’s elements can be read row by row, or
the elements of an array can be placed into a matrix column by column.
A matrix 10,10 crM can also be regarded as an array. In such case the array
will always have the same name as the matrix. The matrix’s cr * elements appear row
by row in the array, and so it is denoted by 1*0 crM . To clarify, element iM is
located in row Ri and column Ci% in the matrix.
A matrix that contains number variables can be cleared, which means that every
variable is set to zero.
11
1.3.5 Abbreviations
This report will make use of some abbreviations. What expression each of them
abbreviates will be explained first time they are presented. They are also presented in the
following table for convenience. For each abbreviation the expression that it represents is
listed. The expressions are further explained in the following paragraph.
Table 2: List of abbreviations that are used in this report.
Abbreviation Expression
3GPP Third Generation Partnership Project
CM Common Memory
DSP Digital Signal Processor
ELTE Ericsson LTE
EMC EMPA Compiler
EMPA Ericsson Multiprocessor Architecture
EMS EMPA Simulator
LDM Local Data Memory
LPM Local Program Memory
LTE Long Term Evolution
CB Command Bus
ULTC Uplink Transport Channel
1.3.6 Expressions
The following table presents expressions used in this report and their meanings. For
each of them an explanation is presented, but the explanation might depend on other
expressions or abbreviations. These have been marked in italic and can be found in the
same table or Table 2 that lists abbreviations.
12
Table 3: List of special words and expression used in this report.
Word or expression Explanation
a0, a1, a2, … These denote the temporary registers of any DSP.
a0l, a0h, a1l, a1h, a2l,
a2h, …, …
These denote the temporary register parts of the temporary
registers.
address register A type of registers in a DSP. Each such register can contain an
LDM address. The registers are then used by LDM instructions.
base station A device that enables mobile communication within its
proximity.
Channel deinterleaver A processing step that is to execute on a base station that
implements the standard LTE.
Channel interleaver One of the processing steps of ULTC.
Clearing The act of setting the value of a variable to zero, or setting every
variable in a matrix or an array to zero.
CM instruction A DSP instruction that is executed to read an array of words
from CM to LDM, or to write in the opposite direction.
column buffer The expression is used when discussing the processing algorithm
and implementation of Channel deinterleaver. It denotes a part of
a job DSP’s LDM that the DSP uses to store one column of its
current segment. The DSP then processes that column by placing
it into the segment.
column repetition This expression is used in the context of Rate matching and Rate
demtaching. In the former, it denotes a subarray of the resulting
bit array e that originate from one column of one of the matrices
X, Y or Z. In the latter, it denotes a subarray of the input byte
array E that is to be soft combined with one column of one of the
matrices U, V or W.
Command Bus A component of EMPA that enables communication between any
pair of DSPs via short messages.
criterion for optimal
loop performance
A criterion that defines optimality for a loop that is executed by a
DSP and that reads or writes data to LDM. See the end of
Chapter 4 for a precise definition.
critical loop A loop that contains no other loops and that an implementation
spends a large time of its execution time to execute.
Digital Signal
Processor
A component of EMPA, although EMPA has multiple examples
of it. It is a microprocessor.
dispatch DSP A DSP that has been dedicated to dispatching JOBs.
DSP instruction Any instruction in the DSP’s instruction set.
13
ELTE Multiprocessor
Architecture
A symmetrical multiprocessor architecture that will be provided
on the base stations that Ericsson AB is developing for the
standard LTE.
EMPA Compiler A compiler used to compile implementations that are to run on
EMPA.
EMPA Simulator A simulator that simulates all the capabilities of EMPA on an
ordinary UNIX computer.
entry function The first function of a JOB to be invoked on any job DSP that
participates in executing that JOB.
Ericsson AB A company that among other things provide telecommunication
equipment.
Ericsson LTE A division of Ericsson AB that is in the work of developing base
stations that utilize the standard LTE.
JOB A job that is dispatched by a dispatch DSP to be executed by
some job DSPs on EMPA. It specifies the functions and variables
that are to be used by the job DSPs.
job DSP A DSP that has been dedicated to executing JOBs that are
dispatched by dispatch DSPs.
LDM instruction A DSP instruction that is executed by a DSP to read 8, 16 or 32
bits from its LDM to one of its temporary registers, or to write in
the opposite direction.
Local Data Memory The local memory of a DSP that is used to store data that the
DSP is to read or modify.
Local Program
Memory
A local memory of a DSP that is used to store instructions that
the DSP is to execute.
Long Term Evolution A standard in mobile communication that is being specified by
3GPP.
m0, m1, m2, … These denote the offset registers of a DSP.
memory instruction A microprocessor instruction that either reads or writes data to a
memory component.
move instruction A DSP instruction that copies the contents of one temporary
register part to another.
offset register A register that can be used by an LDM instruction to modify the
value of the address register that the instruction is using.
parallelized
implementation
An implementation that is executed in parallel on more than one
processor.
processing algorithm An algorithm that specifies how a processing step can be
performed.
14
processing step A definition of what must be done to some input in order to
produce a valid output for that input. The processing step
therefore also specifies what a valid input is.
r0, r1, r2, … These denote the memory registers of a DSP.
Rate dematching A processing step that is to execute on a Base station that
implements the standard LTE.
Rate matching One of the processing steps of ULTC.
repetition This word has a special meaning for the processing steps Rate
matching and Rate demtaching. In the former, it denotes a
subarray of the resulting bit array e that originates from one of
the matrices X, Y or Z. In the latter, it denotes a subarray of the
input byte array E that is to be soft combined with one of the
matrices U, V or W.
SatFunc A function that calculates the sum of two integer bytes. If the
sum is lower than 128 or greater than 127 then the result is
adjusted to the closest boundary.
Section A number of consecutive rows of the matrix in Channel
deinterleaver that one job DSP has been designated to process.
Segment A number of consecutive rows in a section that a job DSP can fit
in its LDM. The DSP processes its section one segment at a time.
The word also denotes the part of its LDM that a DSP has
allocated to store segments of its section.
soft combining The act of calculating the sum of two integer bytes. If the result
is smaller than 128 or larger than 127 then it must be adjusted
to the closest boundary. This is done in Rate demtaching by
using the function SatFunc.
soft value An integer in the range 128 to 127. It denotes the probable
value of one bit that a base station has received from a mobile
device. A high or low soft value indicates that the received bit’s
value is 1 or 0, respectively. A soft value close to zero indicates
that the bit’s value is ambiguous.
temporary register A type of registers in a DSP that can be used by most DSP
instructions to retrieve data from or store results to.
temporary register part One of two parts of a temporary register. This is the register’s
either lowest 16 bits or next higher 16 bits.
Third Generation
Partnership Project
An organization tasked with developing and maintaining
specifications for the 3G and GSM standards
15
Uplink Transport
Channel
This is specified by the standard LTE. It is a set of processing
steps that must be applied to data in a specific order before it is
transmitted from a mobile device to a base station.
Word This denotes as always a sequence of bits in the context of some
computer hardware or one of its memory components. How long
that sequence is depends on the word size of the hardware or
memory. The word size of EMPA’s CM and its DSPs’ LDM and
LPM is 16 bits, and so the expression denotes a sequence of 16
bits.
worker id A non negative integer value that is given to each job DSP that
participates in executing one dispatched JOB. The value is
unique among those job DSPs.
16
2 EMPA EMPA is the symmetrical multiprocessor architecture provided by ELTE’s base
stations. It is used to execute implementations that need to run on the base station, such as
implementations that apply processing steps to received transmissions as described by
Paragraph 1.1.5. EMPA can execute parallelized implementations.
How an implementation is executed on EMPA will be explained by Paragraph 3.2,
while this chapter is concerned with presenting the components of EMPA that are
relevant to the thesis and explain how they are connected. These components are the
Digital Signal Processors, the Common Memory and the Command Bus. However some
of them have properties that can not be publicly revealed due to the reasons stated in
Paragraph 1.3.2. Figure 3 gives a simple overview of EMPA and shows how the relevant
components are connected. The chapter’s source is [11].
The word size of all of EMPA’s relevant memory components is 16 bits. The individual
memory components will be described in this chapter, but because of this throughout this
report the term word refers to a sequence of 16 bits. Also the phrase memory instruction
refers to a processor instruction that either reads or writes data to the memory in question.
There are therefore Local Data Memory instructions as well as Common Memory
instructions that will be greatly referred to throughout this report. The two expressions
will be further explained in this chapter.
Digital Signal
Processor
Common Memory
Data bus
Digital Signal
Processor
Digital Signal
Processor
Command Bus
Figure 3: An overview of EMPA and its relevant units. A dashed line means there are multiple examples of the
adjacent component.
2.1 The Digital Signal Processors The Digital Signal Processors (DSP) are the microprocessors of EMPA that are used to
execute implementations. A parallelized implementation is executed on EMPA by
executing it on multiple DSPs in parallel.
17
The DSP must be well understood if one is to write efficient implementations.4 In fact
this component of EMPA was the most important to the thesis, and it will be extensively
referred to in this report. Therefore a large part of this chapter has been dedicated to
discussing the DSP’s components and capabilities. What need to be covered for the
purpose of the thesis are (i) the DSP’s local memories, (ii) the registers used by the
DSP’s instructions, (iii) the DSP’s capabilities for executing instructions and how some
specific instructions work and (iv) the DSP’s hardware support for executing loops more
quickly. These topics will be covered by the following subparagraphs. Figure 4 gives an
overview of the DSP and those components of it that are relevant to the thesis. The reader
can refer back to it as he reads on. Throughout the report the phrase DSP instruction
refers to any instruction from the DSP’s instruction set.
Local program memoryLocal data memory
Memory control unit
Temporary registers
Adress registers
Offset registers
Registers
Program control unit
Computational units
Digital Signal Processor
Figure 4: An overview of the DSP and its relevant units.
2.1.1 The DSP’s LDM and LPM
Each DSP has a Local Data Memory (LDM) and a Local Program Memory (LPM).
LPM, as any program memory, is used to store instructions that the DSP is to execute,
while LDM is used to store data that is to be read from or written to. The word size of
each memory is 16 bits. The memories follow the big endian principle. A portion of both
4 Sometimes the phrase “the DSP” is used to refer to any DSP of EMPA, but it can of course also be used
to refer to a specific DSP. Which is the case will be obvious by the context that the phrase is used in.
18
memories are unavailable for memory allocations. For the thesis it is relevant to know
that at least # availLDM bytes of LDM are available for use.
2.1.2 The DSP’s registers
The DSP has a large set of registers. Those of them that are relevant to the thesis are
described in the following subparagraphs.
2.1.2.1 Temporary registers
The temporary registers are used by almost every DSP instruction to read values from
and to store results in. Throughout this report they will be denoted by a0,a1,a2,…. Each
register has two temporary register parts. To clarify, a0 will be taken as an example. The
16 least significant bits of it are in the lower register part and it is denoted by a0l. The
second register part contains the next 16 bits of a0 and it is denoted by a0h. So the two
register parts span over a0’s 32 least significant bits. Many DSP instructions use these
parts as if they are individual registers in their own right. For instance, the assembler
instruction mv a0l,a1l copies a0l’s value to a1l.
2.1.2.2 Address registers
Throughout this report an LDM instruction is a DSP instruction that writes data from a
temporary register to LDM, or it reads data in the opposite direction. Also the DSP’s
address registers will be denoted by r0,r1,r2,…. An address register is used to store the
address to a byte in LDM. An LDM instruction then uses the register to read or write to
its address. This will be further explained by Paragraph 2.1.3.3 that covers LDM
instructions.
2.1.2.3 Offset registers
Throughout this report the DSP’s offset registers will be denoted by m0,m1,m2,…. An
offset register is used by certain DSP instructions to modify an address register’s value.
Each offset register can be used to modify only certain address registers. When the report
shows examples of a DSP instruction that uses an offset register for this purpose, it will
be shown which register is being used and which address register is being modified.
2.1.3 Instruction execution on the DSP
What needs to be highlighted regarding instruction execution on the DSP is (i) its
instruction pipeline (ii) how the DSP can execute multiple instructions in parallel and (iii)
what capabilities do LDM instructions have.
2.1.3.1 The DSP’s instruction pipeline
The DSP has a very short instruction pipeline consisting of only three steps, namely
fetch, decode and execute. Most DSP instructions complete each step in one cycle.
Because of this when the DSP is in the progress of executing a series of such instructions,
19
one of them finishes the pipeline by each cycle. Therefore the effective execution time of
each of these instructions is one cycle.
2.1.3.2 Executing parallel instructions on the DSP
The DSP can execute up to # nrINSTRPAR _ instructions in parallel. This is because it
has multiple units that execute instructions. The instructions pass through each step of the
instruction pipeline simultaneously. What instructions are to be performed in parallel
must be specified for the DSP. Certain limitations apply when instructions are to be
executed in parallel. Those that are relevant to the thesis are:
Up to two LDM instructions can be executed in parallel.
Up to 1__ nrLDMWITHPAR instructions can be executed in parallel with
two LDM instructions.
Some instructions can not be executed in parallel with others.
To specify the execution time required to perform parallel instructions, suppose there is
some set of instructions. The one of them with the longest individual execution time
requires #c cycles. This means that if the instruction is not executed in parallel with other
instructions, then it requires #c cycles. Then if the mentioned set of instructions are
executed in parallel, #c cycles are required to finish the task. This means that when the
DSP is executing some instructions in parallel, all of them must finish before the DSP can
move to the next set of parallel instructions.
Parallel instructions may not write to the same register, but one of them may write to a
register that the others read from. The result of this is that the register is first read from by
the other instructions and then written to. This does not prolong the execution time of the
parallel instructions.
To clarify the statements, an example with assembler instructions is in order. The syntax
that is used for parallel instructions will also be presented.
mv 0,a1h | copy a1,a2
is two instructions performed in parallel. The former clears a1h while the latter copies a1
to a2. The result equals to first executing the latter instruction and then the former. The
execution time is the same as for the parallel instructions
mv 0,a1h | copy a2,a3
Note how in this example the two instructions do not use the same register. Also pay
attention to the syntax used for parallel instructions as this will be used in the report.
Specifically,
“task A | task B | task C”
means three parallel instructions perform the tasks A, B and C.
20
There is no way to specify that some instructions are to be performed in parallel when
writing C code. This is completely up to the compiler EMC when it produces the
resulting assembler code from the C code.
Throughout this report it is important to be mindful of the difference between several
DSPs that execute one parallelized implementation on EMPA, and one DSP that executes
several instructions in parallel.
2.1.3.3 LDM instructions and their capabilities
An LDM instruction can (read or) write 8, 16 or 32 bits (from or) to an address of LDM
specified by an address register. The instruction writes data from a temporary register to
the memory (or it reads in the opposite direction). If the instruction reads or writes 8 or
16 bits then it uses a temporary register part. On the other hand if the instruction reads or
writes 32 bits, then it uses a complete temporary register. In any case the data is read
from or written to a number of bytes in LDM, where the address register being used
points out the first byte.
LDM instructions will be referred to extensively throughout the report, and so some
assembler instructions will be shown as examples. st a0,*r0 writes 32 bits from a0 to
four bytes in LDM, where r0 addresses the first. ld *r0,a1 reads four bytes from LDM
where r0 addresses the first. On the other hand st a0l,*r0 writes only two bytes, while
ld *r0,a1h reads two bytes. This becomes obvious by the fact that the latter two
instructions use temporary register parts instead of complete temporary registers.
However LDM instructions that read or write 8 bits use temporary register parts too. It
must be further specified that only 8 bits are being handled. Therefore stb a0l,*r0 and
ldb *r0,a0h writes and reads one byte, respectively.
An LDM instruction takes up to # cyclesTIMELDM _ cycles to execute. Its execution
time is irrespective of the amount of data that it handles. However when two LDM
instructions are executed in parallel there may be a conflict in hardware resources used,
which is imposed by the architecture of the DSP. In such case one of the two instructions
will have its execution time prolonged by # cyclesDELAYLDM _ cycles. But as stated in
Paragraph 2.1.3.2 the DSP must finish executing all instructions that it is currently
executing in parallel before it begins with the next set of parallel instructions. An
example is in order to explain what the consecuence of this may be. The following shows
two LDM instructions executed in parallel with some other instruction.
ld *r0, a1 | ld *r1, a2 | SOME_INSTR
If there is a conflict in resource usage among the two LDM instructions, then one of them
will have its execution time prolonged by # cyclesDELAYLDM _ cycles. If its resulting
execution time c is greater than SOME_INSTR’s execution time, then the DSP requires #c
cycles to execute the set of parallel instructions. However if SOME_INSTR requires at least
#c cycles to execute, then the conflict in resource usage has no effect on execution time.
What makes an LDM instruction powerful is that it can also subtract or add a certain
offset to the address register. This is done after the instruction has read or written data to
21
the memory, but it does not prolong the execution time of the instruction. It is important
to understand that handling the data and modifying the address register are performed by
one LDM instruction, not by two parallel instructions. The offset must be stored in the
address register’s offset register, and the latter is then used to modify the address register
(Paragraph 2.1.2.3 explains offset registers).
To clarify with some examples, st a0,*r0++m0 writes 32 bits to the address specified by
r0, and then increments the address register by the value of m0. ld *r0--m0,a0l reads 16
bits from the memory, and then decrements the address register. Offset registers can in
the same manner be used in conjunction with the instructions ldb and stb that handle
only 8 bits.
2.1.4 Hardware support for executing loops on the DSP
This report will extensively discuss efficient execution of loops. For this sake, a loop’s
overhead refers to the execution of every instruction in the loop that (i) modifies any
counter that keeps track of how many iterations of the loop have been executed, (ii)
compares any such counter with another value to determine if the loop should stop
iterating, or (iii) that are conditional branch instructions that either continue execution in
the next iteration or finish the loop.
The DSP has hardware support to execute a loop without any overhead, but it is
required that the number of times that the loop iterates is known before it is initiated. The
hardware then keeps track of the number of times the loop has iterated. When the last set
of parallel instructions has been executed in the end of one iteration, the DSP continues
executing the first set of parallel instructions in the beginning of the next iteration. This is
done without any delay.
2.2 The Common Memory The Common Memory (CM) is a shared memory that any DSP can access. Its size is #
sizeCM MB, though a portion of it is not available for memory allocations. Its word size
is 16 bits, and it follows the big endian principle.
Throughout this report, a CM instruction is an instruction that the DSP executes to write
data from its LDM to CM (or it reads data in the opposite direction). Be mindful of the
difference that an LDM instruction transfers data between a temporary register and LDM,
while a CM instruction transfers data between LDM and CM.
A CM instruction writes an array of sequential words from LDM to CM (or it reads an
array of sequential words in the opposite direction). Depending on if data is to be read
from or written to CM, the instruction takes at least # latreadCM _ or # latwriteCM _ cycles to
initiate, respectively. It may take longer to initiate due to conflicts in hardware resources
used when several DSPs access CM simultaneously. This is due to the architecture of
EMPA. Once a CM instruction has been initiated it can transfer up to # speedCM words per
cycle. Again, at some cycles the number of words may be smaller due to conflicts in
22
resource usage. However the speed is irrespective from if data is being read or written to
CM.
A byte can not be addressed in CM. If a CM instruction is to read from or write to a
byte in CM, then this must be done to the word that the byte is located in. Because of this
a CM instruction can only read from or write to an array of complete words in CM, and
not any array of bytes. If the instruction is to handle an array of bytes in CM that begins
or ends at an uneven byte address, then the array must first be expanded so it
encompasses complete words. Thus the CM instruction must read from or write to up to
two bytes at the ends of the array that actually are not of interest.
2.3 The Command Bus The Command Bus (CB) allows a pair of processes running on two DSPs to
communicate with one another via short messages. A message can carry a payload of #
bitsPAYLOADMSG _ bits. Additionally the message specifies the id of the receiving
process and a signal number. A process that sends a message specifies these two values.
A receiving process declares interest in receiving a message with a certain signal number.
The execution of the receiving process is halted until it receives a message with that
signal number.
The procedure, from that a process sends a message until the receiving process may
resume its work, takes in average # cyclesTIMEMSG _ to complete.
23
3 Execution methodologies used on EMPA
There are rules and execution methodologies that an implementation that is to run on
EMPA must apply to. Those of them that are relevant to the thesis are presented in this
chapter. What is relevant is (i) how DSPs can synchronize their work by using locks, (ii)
how the execution of a parallelized implementation is dispatched to be performed by
multiple DSPs, and (iii) how the DSPs should perform barrier synchronization.
3.1 Locks There are locks that can be used for synchronizing the work of the DSPs. A lock must
of course be accessible by all DSPs and so it uses a variable in CM. Because of this the
memory latencies of # latreadCM _ and # latwriteCM _ cycles apply when reading and writing
to the lock, respectively (see Paragraph 2.2).
These locks may differ from how the reader would expect a lock to function. First of all
when locking or unlocking a lock, what matters is what DSP performs it, not what
process it is that is running on the DSP and that requests the operation. This means that
when a lock is locked it is the DSP that becomes the holder of it, and the process that
requested the operation may very well completely finish its execution on the DSP. The
DSP will still be the holder of the lock. Its unlocking may very well be requested by a
different process that executes on the DSP at a later time. However confusion still does
not arise because a process must allocate a lock before it can be used. Therefore that
process is the only one that “has knowledge” of the lock, unless it shares this information
with other processes in some manner.
There is another important difference from how these locks differ from the ordinary
idea. Of course if a DSP has locked a lock, then other DSPs that request it to be locked
will be put on hold. They must wait until it is unlocked, in which case one of the waiting
DSPs may proceed and becomes the holder of the lock. This is as expected. However a
lock may be unlocked by any DSP, and not only the one that is its holder.
3.2 To dispatch a JOB When an implementation is requested to run on EMPA, a master function is invoked on
a DSP. The implementation’s input is given to the function. It is most common that the
input is very large, in which case it is placed on a location in CM and a pointer to that
location is provided for the function. The objective of the function is to dispatch the
implementation on a chosen number of other DSPs that will execute it. If more than one
DSP is used for this purpose then it is a parallelized implementation.
24
This procedure is called to dispatch a JOB.5 The DSP that dispatches it is referred to as
a dispatch DSP, while the DSPs that the JOB is dispatched to are called job DSPs. Each
DSP of EMPA has been dedicated to one of the two purposes.
To clarify, say that a processing step is to be applied to some data. The data is placed in
CM and an implementation of the processing step begins to execute when the master
function is invoked on a dispatch DSP. If it is a parallelized implementation, then the
function dispatches a JOB to multiple job DSPs. Those DSPs will cooperate to perform
the processing step over the data.
A JOB defines all the functions and variables that are to be used by a job DSP that is to
perform it. These have been placed in CM even before the master function was invoked.
They will be uploaded to the job DSP’s LPM and LDM when it begins to execute it.
However the definition of a JOB does not include the number of job DSPs that are to
execute it, nor the input that is provided for the master function. These may be different
from time to time that the JOB is dispatched. Also a JOB is dispatched with one of three
possible priorities. What meaning the priorities have will be explained shortly but the
reader must understand that the same JOB can be dispatched with different priorities.
Every JOB has an entry function, which is the first function to be invoked on the job
DSPs. The function can have two 32 bit variables as its input. If this is not enough for the
part of the input that the job DSP is to process, then one of these variables may be a
pointer to a location in CM where the input is stored. The entry function can not
explicitly return a variable as an ordinary function. Its output is to either be stored in CM
at a location specified by the entry function’s input or sent back to the dispatch DSP via
CB.
Figure 5 illustrates what happens when a dispatch DSP dispatches a JOB, while Figure
6 shows how a job DSP cycles through executing one JOB after another. The reader
should review them as he reads on. When the master function performs the dispatch it
specifies (i) the input of the entry function, (ii) the JOBs priority and (iii) the number of
job DSPs that are to execute the JOB. Upon dispatch, those job DSPs that announce
themselves as available receive a message over CB that instructs them to execute the
JOB. If there aren’t enough available DSPs, then the JOB is also placed in a queue in
CM. It is important to note that those DSPs that were available began executing the JOB,
even if they aren’t sufficiently many. It is specified in the queue how many more DSPs
are required. Of course if there are no available DSPs, then the JOB is only placed in the
queue.
There are three queues, one for each priority that a JOB may have. Each queue follows
the first in first out principle. When any job DSP completes executing its current JOB, it
reads the queues in order of priority. It picks a JOB from the first non-empty queue. If
there are no JOBs at all, then the DSP announces itself as available and awaits a message
5 Capital letters are used for this name so that the ordinary word “job” can be used freely without
confusion.
25
over CB as previously described. Note that completing a JOB does not make a DSP to
announce itself as available, since if it finds a JOB in a queue then it begins executing it.
It is only if the queues are empty that it becomes available.
A dispatch DSP dispatches
a JOB that is to execute on
#N job DSPs
Are #N or more job
DSPs announcing
themselves as
available?
Are there 0<X<N job
DSPs that announce
themselves as
available?
#X job DSPs are
instructed via CB to begin
executing the JOB
The JOB is placed on a queue
with a request that N-X more
job DSPs participate in
executing it.
#N job DSPs are
instructed via CB to begin
executing the JOB
The JOB is placed on a
queue with a request that
#N job DSPs execute it
Yes
No
Yes No
Figure 5: The figure shows how a JOB is dispatched by a dispatch DSP.
26
The job DSP finishes
executing its current JOB
Are there any
JOBs in the
queue?
The JOB with the highest
priority is chosen
The job DSP awaits a
message via CB
The job DSP uploads the
JOB’s functions and data
to its local memories and
begins executing it
Is the job DSP
instructed via CB
by a dispatch DSP
to execute a JOB?
The job DSP is announcing itself as unavailable
The job DSP is announcing itself as available
No
Yes
No
Yes
Figure 6: The figure shows how a job DSP cycles through executing JOBs, and it shows when the DSP is
announcing itself as available.
If the input to the entry function is to be different among the job DSPs then multiple
separate dispatches of the same JOB is required. Compare this to dispatching it once
which requires only one message over CB and is performed much quicker. But even if
the JOB is dispatched only once, each job DSP that participates in it is provided with a
worker id. This is a non negative integer value that is unique among the job DSPs that are
participating. Each of them can use it to calculate what part of the JOB it has been
designated to perform.
Once the master function has dispatched its JOB, it must await a message from the job
DSPs that confirms the JOB has been completed. Previously, each job DSP sent a
message over CB once it had completed executing. The master function would keep track
of the number of received messages and realize when all of the job DSPs are done. A
problem with this method is that it generates a lot of undue stress on the dispatch DSP.
The solution that is now required to be used by every implementation is that only the last
job DSP that completes executing the JOB sends a message. A lock is used to keep track
of this.
It is strongly requested that the dispatch DSP (where the master function runs) takes
part in the actual work of the JOB as little as possible. Execution time on dispatch DSPs
is solely to be used to perform dispatches and for overall maintenance work required by
EMPA. In fact, the thread running the master function on the dispatch DSP will most
27
likely be switched out once the JOB has been dispatched and the master function begins
to wait for a message.
3.3 Memory allocation Both job and dispatch DSPs can allocate memory in CM. But only a dispatch DSP can
dynamically allocate memory in its LDM. Functions that are to run on job DSPs must
specify all the memory that they require at compile time. But for simplicity when the
report describes some work that a job DSP performs it may still be stated that it “allocates
memory in its LDM”. The reader should not be confused and understand that this means
that the memory has been reserved at compile time.
3.4 Barrier synchronization between the job DSPs It is possible for some job DSPs that require barrier synchronization to perform this by
using locks. During the thesis an algorithm was written for this purpose that uses three
locks to enforce mutual exclusion and to perform the actual barrier synchronization.
However, it was of no use because it is not acceptable that a set of job DSPs use explicit
barriers to synchronize their work. Read Paragraph 3.2 to understand how JOBs are
dispatched to the job DSPs. Say that a JOB is requested to execute on three DSPs, but
only two are available at the time of the dispatch. They begin to execute the JOB, and the
last DSP can begin much later. If the first two synchronize at a barrier, then they must
wait until the last one initiates the JOB and arrives at the same barrier.
An unacceptable large portion of execution time on all job DSPs is wasted if barriers are
widely used. In fact, it is theoretically possible that all job DSPs of EMPA fall into a
deadlock because each of them is waiting at a barrier in a JOB that has not been fully
dispatched to the required number of job DSPs.
If an algorithm requires barrier synchronization at a point in its calculations, then it is
not implemented as one JOB using an explicit barrier. Instead it is implemented as two
JOBs. The first JOB ends where the barrier synchronization takes place in the algorithm,
and the next JOB continues from that location. When the job DSPs have finished the first
JOB and the master function is notified, it proceeds by dispatching the second JOB. Each
DSP can finish executing the first JOB at any time irrespective of the other DSPs. The
disadvantage of this solution is that another costly dispatch must be performed, thus
prolonging the execution time of the algorithm’s implementation.
28
4 Definition of an optimal loop Efficient execution of loops will be extensively discussed in this thesis. Therefore some
crucial definitions are in order. First, a critical loop is a loop that (i) does not contain any
loops and (ii) that the implementation spends a large part of its execution time to execute.
The latter criterion is a bit subjective but nevertheless the definition suffices for this
report. Obviously optimizing critical loops can be a great way to reduce the
implementation’s overall execution time.
The next definition is a bit bold, namely whether if a loop that a DSP is to execute is an
optimal loop or not. To answer this, one must first know (i) what is the task of the loop
and (ii) how does the capabilities of the DSP limit that task? If the limit is met and the
capabilities of the DSP are thus optimally used by the loop, then it is said to be optimal.
This definition may seem subjective but what are required are a precise definition of the
loop’s task and a complete understanding of the limitations of the DSP that apply to the
task.6
An example is in order. Say that a loop is tasked with adding 8 bit integers from two
arrays and store the results in a third array. Specifically, there are three byte arrays X, Y
and Z in LDM with n bytes each. iYiXiZ is to be calculated for every
10 ni , and it is so that 127128 iYiX .
The task of the loop has been defined. Now one must obtain a complete understanding
of the DSP’s capabilities that apply to the loop in order to make it optimal. First and
foremost, the loop’s overhead must be eliminated by using the DSP’s hardware support
for executing loops (see Paragraph 2.1.4 for a description of the hardware support).
Second, each LDM instruction must read or write as many bits as possible (namely 32
bits) to LDM and modify its address register if necessary (see Paragraph 2.1.3.3 for the
capabilities of LDM instructions). Separate instructions may not be used to update the
address registers. Last but not least, a careful review is required of what DSP instructions
can perform additions. Carefully reviewing [11] shows that the instruction add4 is very
suitable. add4 a0,a1,a2 treats the four lower bytes of each of the three temporary
registers as individual integers. The lowest byte of a0 is added to the lowest of a1 and the
result is stored in the lowest of a2, and so on. The instruction performs four additions in
this manner. Two add4 instructions can be performed in parallel, but only one add4
instruction can be performed in parallel with two LDM instructions. It is also guaranteed
that an add4 instruction will never require longer execution time than an LDM
instruction.
In light of this, one might be tempted to think that the loop shown in Algorithm 1
performs the task optimally. It calculates eight integers of Z per iteration, so it must be
6 Throughout this paragraph, one should consider that a capability can also be regarded as a limitation.
29
followed by another loop that calculates any remaining integers in the end of Z. But there
can not be more than seven remaining, and so the second loop has been omitted.
Algorithm 1: A non-optimal loop for the task.
//The loop calculates eight bytes of Z in each iteration.
//X, Y and Z are in LDM.
//The variable i is only used for clarification,
//and no instruction is used to update it.
//The LDM instructions update their address registers accordingly.
//The loop has no overhead since the DSP’s hardware support is used.
//i = 0;
loop {
//Two parallel LDM instructions.
LDM instruction that reads X[i...i+3] to a0 |
LDM instruction that reads X[i+4...i+7] to a2;
//Two parallel LDM instructions.
LDM instruction that reads Y[i...i+3] to a1 |
LDM instruction that reads Y[i+4...i+7] to a3;
//Two parallel add4 instructions.
add4 a0, a1, a4 | add4 a2, a3, a5;
//Two parallel LDM instructions.
LDM instruction that writes a4 to Z[i...i+3] |
LDM instruction that writes a5 to Z[i+4...i+7];
//i += 8;
}
This loop is not optimal. It is true that every LDM instruction transfers the maximal
possible amount of data, which is 32 bits. Also LDM instructions are performed in
parallel pair-wise, which is the maximal number of LDM instructions that can be
executed in parallel. Paragraph 2.1.3.3 explains how there may be a conflict in hardware
resource usage when parallel LDM instructions are executed. This would prolong the
execution of the LDM instructions by # cyclesDELAYLDM _ cycles. However it is
guaranteed that this does not happen in the loop.
Still the loop fails be optimal due to the simple reason that it does not execute the add4
instructions in parallel with the LDM instructions. But this remark can be stated in a more
convenient manner. No matter how the loop’s task is done, in the end a certain amount of
data must be read from LDM (specifically from X and Y), and a certain amount of data
must be written to LDM (specifically to Z). This is not being done with the maximal
possible speed, because when the add4 instructions are being executed then no data is
being read nor written to LDM. Another attempt is made:
30
Algorithm 2: An optimal loop for the task.
read X[0...3] to a0 | read Y[0...3] to a1;
read X[4...7] to a2 | read Y[4...7] to a3;
//i = 0;
loop {
read X[i...i+3] to a0 | read Y[i...i+3] to a1 | add4 a0, a1, a4;
read X[i+4...i+7] to a2 | read Y[i+4...i+7] to a3 | add4 a2, a3, a5;
write a4 to Z[i...i+3] | write a5 to Z[i+4...i+7];
//i += 8;
}
//The loop stopped at i = j.
add4 a0, a1, a4 | add4 a2, a3, a5;
write a4 to Z[j...j+3] | write a5 to Z[j+4...j+7];
The construction requires that some of the work is performed before and after the loop,
but what instructions are performed outside of the loop does not determine if it is optimal
or not. For this loop it is easy to make sure that there is no conflict in resource usage
when parallel LDM instructions are executed. Also recall that it is guaranteed that the
execution time of the add4 instruction is not longer than an LDM instruction’s. So no
LDM instruction will have its execution time prolonged due to resource conflict, nor
because it must wait until a parallel add4 instruction is done.
This makes the new loop optimal. To see this understand that no matter how the loop’s
task is performed, in the end some data must be read from and written to LDM, and the
loop spends its entire execution time to transfer data between the temporary registers and
LDM with the maximal possible speed.
Loops that are tasked with (i) reading data from LDM, (ii) processing it in some way
and (iii) writing the result to LDM occur frequently in the implementations that are
written by ELTE. The critical loops of the implementations written in the thesis are also
of this type. A simple yet instructive criterion can be stated that guarantees such loop is
optimal: the loop must uphold the following:
By each cycle two parallel LDM instructions must be executing.
There may be no conflicts in hardware resource usage when executing parallel
LDM instructions.
Each LDM instruction must read or write 32 bits to LDM.
Other instructions that are used may only be executed in parallel with LDM
instructions.
Other instructions may not have a longer execution time than an LDM instruction.
Throughout this report the criterion will be referred to as the criterion for optimal loop
performance. If it is met then it is guaranteed that the loop is transferring data between its
temporary registers and LDM with the maximal possible speed. It does not matter what
31
the loop is required to do with the data and how this is done. In the end the data must at
least be read and written and the loop spends its entire execution time to do this as fast as
possible.
Note that a loop can be optimal even if it does not meet this criterion. Its task may be
either completely different or perhaps limited by some other DSP limitation than the
amount of data that can be read or written to LDM per cycle.
32
5 Specification of the processing steps The purpose of this chapter is to specify the processing steps Channel deinterleaver and
Rate dematching that are to be implemented on EMPA in the thesis. But understanding
them becomes easier by first understanding how they are related to two processing steps
that are performed on the mobile device. The sources of the chapter are [4], [1] and [9].
5.1 Overview of the Uplink Transport Channel The standard LTE specifies an Uplink Transport Channel (ULTC) that outlines how
data must be processed before being transmitted from the mobile device to the base
station. A set of processing steps must be applied to the data in a specific order. See the
upper half of Figure 2 for a description. The data is the input of the first processing step
and its output is the input of the next step which in turn produces an output that the next
one will receive, and so on. The output of the last step is transmitted to the base station.
To explain each of these processing steps and what their purposes are is beyond the scope
of the thesis. Only those of them that are relevant to the thesis will be described in detail,
namely Rate matching and Channel interleaver.7
The base station must obtain the original data that was the first input of ULTC’s
processing steps. This is achieved on the base station by applying another set of
processing steps in a specific order to the received transmission. Figure 2 clarifies this.
Some of the base station’s steps are directly related to some of ULTC’s. Two examples
are Rate dematching and Channel deinterleaver that are to run on the base station and
were implemented in the thesis. They are directly related to the two steps of ULTC that
were mentioned above. Understanding the two processing steps that were implemented in
the thesis becomes easier by understanding their two counterparts in ULTC. That is why
all four processing steps will be specified in this chapter.
There is another subtle difference between ULTC’s and the base station’s processing
steps. When the base station has received the transmission it can not be certain if the bits
that the mobile device transmitted have changed due to radio interference. For this
purpose, in mobile communication technology a method is used for estimating the
probability that a received bit has a certain value. Let bP0 and bP1 denote the
probabilities that a received bit b is zero or one, respectively. The following value
rounded to the closest integer is called a soft value:
7 Note that these are not the processing steps that were implemented in the thesis. Read further to
understand.
33
otherwiselog
log128 if128
log127 if127
0
1
0
1
0
1
bPbP
bPbP
bPbP
Formula 1: The value rounded to an integer is a soft value.
A soft value close to 127 or 128 says that the intended value of the bit is likely one or
zero, respectively. But if the soft value is close to zero then the bit’s value is ambiguous.
When the base station receives a transmission, for each received bit it determines a soft
value. Then it applies its processing steps to the soft values. Thus ULTC’ steps work with
bits, while the base station’s steps work with bytes.
5.2 Channel deinterleaver In this paragraph Channel interleaver, that is one of ULTC’s processing steps, is
specified first. Channel deinterleaver is specified next, which is one of the base station’s
processing steps.
5.2.1 Specification of Channel interleaver
The processing step receives three arrays of so called symbols. A symbol is #M bits
long, where M is also specified by the input. The following is the input:
An array of symbols 10 Gg .
An array of symbols 10 Ir .
An array of symbols 10 Qq .
A positive integer M that equals 2, 4 or 6. It denotes a symbol’s length in bits.
A positive integer C that equals 10 or 12. It denotes the number of columns in a
matrix that is to be used in the processing step.
The lengths of the arrays G, I and Q are also specified by the input. The following will
be the processing step’s output:
An array of symbols 10 IGs .
A matrix X is used in the processing step. Each cell of the matrix can fit a symbol. It has
#C columns and CIGR rows. IG is a multiple of C. R can not be greater than
1200.
The processing step will now be explained. Figure 7 shows how it works step by step
for a small example. The reader should consult the figure as he reads on.
34
The symbols of the array r are first placed into the matrix as Algorithm 3 shows.
Starting from the last row and moving up, the symbols are inserted into four columns
designated by the array columnSet.
Algorithm 3: The algorithm shows how the symbols of r are placed into the matrix X.
//The symbols of r[0...I-1] are to be placed into the matrix X.
//columnSet[0..3] is an array of positive integers.
//R is the number of rows in the matrix X.
columnSet = [1, 10, 7, 4];
i = 0;
j = 0;
currRow = R - 1;
while (i < I) {
currCol = columnSet[j];
X[currRow, currCol] = r[i];
i++;
j++;
j = j % 4;
if (j == 0) {
currRow--;
}
}
The symbols of g are then inserted into the free cells of X as Algorithm 4 shows. The
symbols are inserted row by row, but cells that are already occupied by one of r’s
symbols are skipped.
35
Algorithm 4: The algorithm shows how the symbols of g are placed into the matrix X.
//The symbols of g[0...G-1] are to be placed in the matrix X.
//R and C are X’s number of rows and columns respectively.
i = 0;
currRow = 0;
currCol = 0;
while (i < G) {
if (X[currRow, currCol] is not occupied by a symbol of r) {
X[currRow, currCol] = g[i];
i++;
}
currCol++;
if (currCol == C) {
currCol = 0;
currRow++;
}
}
Now all the matrix’s cells are occupied by symbols because it has exactly IG cells.
Next some of g’s symbols in the matrix are replaced by q’s according to Algorithm 5.
Note the similarity between it and Algorithm 4. The only difference is that columnSet
designates different columns in the two algorithms.
Algorithm 5: The algorithm shows how the symbols of q are placed into the matrix X.
//The symbols of q[0...Q-1] are to be placed in the matrix X.
//columnSet[0..3] is an array of positive integers.
//R is the number of rows in the matrix X.
columnSet = [2, 9, 8, 3];
i = 0;
j = 0;
currRow = R - 1;
while (i < Q) {
currCol = columnSet[j];
X[currRow, currCol] = q[i];
i++;
j++;
j = j % 4;
if (j == 0) {
currRow--;
}
}
36
The processing step’s final output is an array of symbols 10 IGs . It is obtained
by reading X’s symbols column by column.
Figure 7: The processing step is demonstrated for the matrix X with 12C columns and 6R rows. The
arrays g, r and q have 59G , 13I and 10Q symbols respectively. X is shown after each step from top
to bottom. Cells that are modified have a darker shade.
5.2.2 Specification of Channel deinterleaver
The processing step receives one array of symbols, but now a symbol is #M bytes long.
M is also specified by the input. The following is the input:
A positive integer Q (its purpose will be explained shortly).
An array of symbols 10 NS , where N is also specified by the input.
A positive integer M that equals 2, 4 or 6. It denotes a symbol’s size in bytes.
A positive integer C that equals 10 or 12. It denotes the number of columns in a
matrix that is to be used in the processing step.
The following will be the processing step’s output:
37
An array of symbols 10 NT .
A matrix Y is used in the processing step. Each cell of the matrix can fit a symbol. It has
#C columns and CNR rows. N is a multiple of C. R can not be greater than 1200.
How the processing step works will now be explained. Figure 8 shows how it works
step by step for a small example. The reader should consult the figure as he reads.
First, the symbols of S are placed into the matrix Y column by column. The symbols are
placed into the matrix in incremental order starting with 0S , then 1S , and so on. Every
cell of Y is now occupied by a symbol of S.
Next, #Q of the symbols in Y are cleared, i.e. they are set to zero. Algorithm 6 shows
what symbols. Note the similarity between it and Algorithm 5 that shows where Channel
interleaver places the array q’s symbols into the matrix X.
Algorithm 6: The algorithm shows what symbols of Y are cleared.
//#Q of Y’s symbols are cleared.
columnSet = [2, 9, 8, 3];
i = 0;
j = 0;
currRow = R - 1;
while (i < Q) {
currCol = columnSet[j];
Y[currRow, currCol] = 0;
i++;
j++;
j = j % 4;
if (j == 0) {
currRow--;
}
}
The processing steps final output is an array of symbols 10 NT . It is obtained by
reading the symbols of Y row by row.
38
Figure 8: The processing step is demonstrated for Y with 12C columns and 6R rows. The array S has
72N symbols. Q equals 10. Y is shown after each step from top to bottom. Cells that are modified have a
darker shade.
5.3 Rate dematching In this paragraph Rate matching, that is one of ULTC’s processing steps, is specified
first. Rate dematching is specified next, which is one of the base station’s processing
steps. The specification of the latter relies greatly on the former. The reader is therefore
advised to read the paragraph about Rate matching carefully before he attempts to
understand Rate demtaching.
5.3.1 Specification of Rate matching
The input of the processing step is the following:
Three bit arrays, namely 10 Da , 10 Db and 10 Dc . The first #N
bits of each array are NULL. The length of the arrays is also specified by the
input and is constrained by 6148D .
A positive integer S, which denotes the length of the processing step’s output bit
array.
A non-negative integer 3T (its purpose will be explained shortly).
The output of the processing step is:
A bit array 10 Se .
39
Obviously a bit's value can be set to either 0 or 1, not NULL. But the #N first bits of the
three arrays are considered to be NULL. Besides from this, they will be treated just as all
the other bits in this processing step. If it is stated in the description of this processing
step that a bit is set to NULL, then the reader should understand that from there and on
the bit’s value is considered to be NULL.
For simplicity of explanation, this processing step can be divided into four consecutive
steps. The reader should not be confused by the usage of the word “step” and not
consider them to be individual processing steps. The steps are in order Padding,
Permutation I, Permutation II, Bit collection and Bit selection. Figure 9 gives an
overview of the entire processing step. The reader should often refer back to it as he reads
on. Padding is performed on the arrays a, b, and c individually. Then the output of each
step is passed on as input to a following step. Finally the array e is the output of Bit
selection. The integer T is used in the final step.
In the following subparagraphs, the five steps will be explained individually. A small
example of the entire processing step will be followed throughout each step. A figure in
each subparagraph will explain what that step does to the example. The reader should
also consult these figures often.
Two positive integers will be used throughout the processing step. R is the smallest
integer such that RD *32 . K equals 32*R. Also, the function P will be used. It is
defined as follows:
Table 4: Definition of the function P.8
i 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
iP 0 16 8 24 4 20 12 28 2 18 10 26 6 22 14 30
i 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
iP 1 17 9 25 5 21 13 29 3 19 11 27 7 23 15 31
8 Note that the function is its own inverse. Also note that 161 iPiP ,
8116 iPiP and 2418 iPiP for all 70 i . Longer such chains can be
observed but the details are omitted.
40
a[0...D-1] b[0...D-1] c[0...D-1]
Padding Padding Padding
a’[0...K-1] b’[0...K-1] c’[0...K-1]
Permutation I Permutation I Permutation II
Bit collection
Bit selection
w[0...3K-1]
e[0...S-1]
x[0...K-1] y[0...K-1] z[0...K-1]
Figure 9: An overview of the entire processing step. For each step it is shown what is its input array and
what output array is produced.
5.3.1.1 Padding
This step is applied to the bit arrays a, b and c individually. The output for them will be
three bit arrays 10 Ka , 10 Kb and 10 Kc , respectively. It will only be
explained for a, as it does the same thing to the three of them. The first DK bits of a
are set to NULL. The array a is then placed into the remaining D bits of a . Figure 10
shows how the step works for a small example.
41
Figure 10: The figure demonstrates the step Padding when applied to the array a for 246D and 62N .
This implies that 256K , and so the first 10DK bits of a are set to NULL. The tables in the figure
show how a is modified through the step. They show only some parts of the array, and bits that are NULL have
a darker shade.
5.3.1.2 Permutation I
This step is applied to the bit arrays a and b individually. It will only be explained for
a , as it does the same thing to both of them. The output for them will be two bit arrays
10 Kx and 10 Ky respectively. A matrix of bits 31...0,10 RY is used in
this step. It has #R rows and 32 columns. Figure 11 illustrates the step for a small
example.
The bits of a are placed in Y row by row. Next, Y’s columns are permuted such that
after the permutation column i was before the permutation column iP . The output bit
array x is then obtained by reading the bits of Y column by column.
42
Figure 11: The figure demonstrates the step Permutation I when applied to the array a for 246D and
62N . This implies that 8R and 256K . The tables in the figure show how the matrix Y is modified
through the step. Only some of the matrix’s columns are shown. Each table shows how the matrix’s bits originate
from the array a. Those cells that only say “NULL” are bits that were inserted during the step Padding and so do
not originate from a.
5.3.1.3 permutation II
This step is applied to the bit array c . The output will be the bit array 10 Kz . A
function is used that is defined as follows:
10 where,%1%*32 KiKRiRiPiF
Formula 2: Definition of the function F.
The function P is defined as in Table 4. The bits of c are moved to z as follows:
10for , KiiFciz
The following figure demonstrates the step for a small example:
43
Figure 12: The figure demonstrates the result of applying the step Permutation II to the array c for 246D
and 62N . This implies that the length of the array is 256K . Some parts of the resulting array z are
shown. It is shown how its bits originate from the array c. Those array positions that say “NULL” are bits that
were inserted during the step Padding and so do not originate from c.
5.3.1.4 Bit collection
This step is applied to the bit arrays x, y and z. The output will be a bit array
130 Kw . The three arrays’ bits are placed into w as follows:
10for ,12
10for ,2
10for ,
KkkzkKw
KjjyjKw
Kiixiw
Formula 3:How the three arrays’ bits are placed in w.
The following figure demonstrates the step for a small example.
Figure 13: The figure demonstrates the result of applying the step Bit collection to the arrays x, y and z for
246D and 62N . This implies that the length of each of the arrays is 256K and so the resulting
array w’s length is 3K. Some parts of w are shown. It is shown how its bits originate from the arrays a, b and c.
Some of the bits are NULL bits that were placed into the arrays a , b and c during the step Padding. These
have been marked by “NULL_a’”, “NULL_b’” and “NULL_c’”. Note how Formula 3 places x into the first third
of w, while y and z are interlaced bit by bit and fill the rest of w. This becomes obvious in this example.
44
5.3.1.5 Bit selection
This step is applied to the bit array w. The output will be a bit array 10 Se .
Algorithm 7 shows how e is produced. Figure 14 demonstrates the step for a small
example. Starting from bit offsetw where 224 TRoffset , w is traversed circularly
until enough bits have been collected for e. However, bits that are NULL are skipped.
Algorithm 7: The algorithm shows how the bit array e is produced from w.
//The input is a bit array w[0...3K-1],
//a non-negative integer T<=3 and a positive integer S.
//The output is a bit array e[0...S-1].
offset = R*(24*T + 2);
j = 0;
k = 0;
while (k < S) {
if (w[(offset + j)%(3*K)] != NULL) {
e[k] = w[(offset + j)%(3*K)];
k++;
}
j++;
}
Figure 14: The figure demonstrates the result of applying the step Bit selection to the array w for 246D ,
62N , 1T and 498S . This implies that the length of w is 7683 K . Also the processing step’s
final output array e’s length is 498S . Some parts of e are shown in the figure. It is shown how its bits
originate from the arrays a, b and c. NULL bits are not included in e so every bit is from either a, b or c. Due to
the value of T, e begins with bits from a. 5523 ND of w’s bits are not NULL, yet the length of e is
552S . So w has been traversed a complete time and then some more when sufficiently many bits have been
read. This is why e also ends with bits from a.
5.3.2 Specification of Rate dematching
The input of the processing step is:
A byte array 10 SE , where S is specified in the input.
45
Three byte arrays, namely 10 DA , 10 DB and 10 DC . The first
#N bytes of each array are NULL. The length of the arrays is constrained by
6148D .
A non-negative integer 3T (its purpose will be explained shortly).
A boolean CLEAR.
The processing step has no explicit output. Its purpose is to modify A, B and C as it will
shortly be described. The bytes of the arrays are soft values. Therefore no byte value can
be spared to indicate that the byte is NULL. But as in Rate matching, the #N first bytes of
the three arrays are considered to be NULL. Besides from this, they will be treated as the
other bytes in the processing step.
A function SatFunc will be used in this specification and is defined as follows for any
integer x:
otherwise
128 if128
127127
x
x
x
xSatFunc
Formula 4:Definition of SatFunc for any integer x.
The processing step begins by clearing A, B and C if CLEAR is set. The rest of the
specification relies on Rate matching. Assume Rate matching is performed for three bit
arrays 10 Da , 10 Db and 10 Dc where the first #N bits of each array are
set to NULL. The values of the other bits are irrelevant. The inputs D, N, S and T of Rate
matching are to be the same as they are for Rate demtaching. The output of the former
will be a bit array 10 Se . It is important to understand that each of e’s bits
originates from a bit of either a, b or c. It is true that e was produced from the array w in
the last step Bit selection, and some of w’s bits are NULL bits that were not introduced
before the step Padding. But NULL bits were not written to e and besides from these all
the other bits of w originate from either a, b or c. Now, starting from 0E and moving
forward the values of E are used to modify A, B and C as Algorithm 8 shows.
46
Algorithm 8: The algorithm shows how the bytes of E are used to modify A, B and C.
//Input is four byte arrays E[0...S-1], A[0...D-1], B[0...D-1] and C[0...D-1]
//and two non-negative integers N and T <= 3.
//Rate matching is performed first.
e[0...S-1] = Rate_matching (a[0...D-1], b[0...D-1], c[0...D-1], N, T, S);
//The array E is then used to modify the arrays A, B and C as follows.
for (i = 0; i < S; i++) {
if (e[i] originates from a[j]) {
A[j] = SatFunc (A[j] + E[i]);
} else if (e[i] originates from b[k]) {
B[k] = SatFunc (B[k] + E[i]);
} else { //e[i] originates from c[l]
C[l] = SatFunc (C[l] + E[i]);
}
}
To give an example and introduce some terminology, if ie originates from ja then
iE is used to modify jA by using SatFunc. It is said that the latter two are soft
combined. Figure 14 illustrates an example of Rate matching, where the input parameters
D, N, T and S have some specific values. The resulting array e is shown. Suppose Rate
demtaching is executed for the same values for the four input parameters. Then the same
figure also makes it obvious for each byte of E with what byte of A, B or C it is to be soft
combined with.
47
6 Processing algorithm for Channel deinterleaver
This chapter presents the processing algorithm that was used for Channel deinterleaver.
Paragraph 6.1 first makes some relevant observations regarding the processing step.
Paragraph 6.2 suggests how a processing algorithm can perform the processing step. This
includes how the work can be divided among the job DSPs that are to perform the
processing step. The algorithm that was chosen as a base for the implementation is
presented by Paragraph 6.3 and further specified in Appendix 2.
An implementation of the processing algorithm was written during the thesis. For
details regarding the implementation and what steps were taken to verify its correctness
and to improve its performance, refer to Appendix 2.
6.1 Initial discussion The following subparagraphs discuss the processing step beyond its specification in
Chapter 5. The purpose is to highlight interesting properties that it has before a
processing algorithm is discussed.
6.1.1 Regarding existing literature on matrix transposition
The major task of the processing step is to perform matrix transposition; the symbols
are placed into a matrix column by column and then read row by row (see Paragraph
5.2.2). The symbols are to be permuted from column major order to row major. This
problem has been widely studied, and there are many algorithms in the literatures that are
claimed to be efficient. The problem is that no algorithms that are suitable for the
processing step’s matrix were found. Nor is the underlying hardware architecture that the
literatures assume is being used a good description of EMPA.
To give examples, [3], [6] and [5] provide a good summary of several transposition
algorithms and review their performance. Most of these algorithms are concerned with
square matrices, which is clearly not suitable for the processing step, since the matrix has
up to 120 times more rows than columns. An approach where the non square matrix is
extended to be square is clearly infeasible.
However, even the algorithms that do work with a non square matrix still put special
restrictions on its dimension that is not suitable for the processing step. For instance one
of the algorithms of [3] first divides the matrix along its columns into multiple new
matrices, then divides each new matrix along its rows, then divides the resulting matrices
along their columns, and so on. This is clearly not applicable to the processing step’s
matrix, as it can have many times more rows than columns.
Besides from issues regarding the shape of the matrix, the assumptions that have been
put on the underlying hardware do not match EMPA. Some of the algorithms of [3], [6]
48
and [5] assume that the matrix in question is so large that it must reside on secondary
memory while parts of it are moved to the primary memory for processing. One might be
tempted to regard EMPA’s CM as secondary memory while the DSP’s LDM is the
primary memory. Even though it is the case that the processing step’s matrix can be so
large that it can not reside in LDM, the memory latencies of CM that require only a few
cycles can not be compared to the I/O operations that [3] and [6] assume take multiple
milliseconds.
On a final note, all the algorithms of [3], [6] and [5] are in-place. This is not required by
an implementation of the processing step; for the input that it receives in CM, it is
allowed to use an equally large portion of the memory to produce the output.
All these issues also occur in literatures that suggest how matrix transposition can be
parallelized (see [7]).
6.1.2 The q symbols
Some of the symbols that are placed into the processing step’s matrix are to be cleared
(see Paragraph 5.2.2). These symbols will be greatly discussed and will therefore be
referred to as q symbols. q symbols are in the matrix’s last rows, and there can be at most
four in each row.
The number of q symbols is independent of the number of rows in the matrix, except
there can be at most four q symbols in each row. It rarely exceeds 100, and is in average
no more than 40.
6.2 Suggestion for processing algorithm A processing algorithm is suggested in this paragraph. What need to be discussed is
how the work should be divided among the job DSPs, and how the q symbols can be
cleared.
6.2.1 How to parallelize
Consider a general case where a matrix is to be transposed on EMPA. Initially the
matrix’s elements are written column by column in CM, and they are to be permuted to
be in the order row by row. A part of the memory that is equally large as the input is to be
used to write the output. Every element is a number of complete words large, but a
specific size will not be mentioned to keep the discussion general. Also, the matrix can
have any number of columns and rows. Figure 15 demonstrates this for a small matrix.
49
Figure 15: The figure shows two ways a matrix 70,70 M can appear in CM, namely column by column
and then row by row.
A number of job DSPs are to work together. Each of them will read some of the
matrix’s elements from the input in CM to its LDM. When they have been rearranged
they are written back to CM where the output should be.
A DSP will perform a CM instruction to read or write an array of elements from the
memory. The memory latencies latreadCM _ and latwriteCM _ apply for initiating such an
instruction (see Paragraph 2.2). Besides from them, say that in average # cyclescm cycles
are required to read or write an element of the array. Also, the LDM of every DSP has
available memory for # elementsldm elements.
Each DSP can be assigned a rectangular part of the matrix, covering a number of
columns and rows. Its task is to read the part’s columns from the input, and write the
corresponding rows of the part to the correct positions of the output. The part of the
matrix that has been assigned to a DSP will be referred to as its section. Assume for this
discussion that the matrix can be divided into equally large sections, one for each DSP.
50
Figure 16 demonstrates the idea for a small matrix. It also highlights what elements one
of the DSPs must read from the input and write to the output.
The benefit of this method is that it is straightforward and no synchronization is
required. Each DSP writes to its own parts of the output, not interfering with the work of
the other DSPs.
51
Figure 16: The matrix 70,70 M is divided into 8 sections. Each section spans over 2 columns and 4
rows. The figure first highlights one of the sections and the elements that it includes. It then shows for that
section what parts of the input must be read from CM, and finally what parts of the output must be written to.
Assume that the section does not fit into the DSP’s LDM. One way to process the
section is to divide it into multiple rectangular parts. Such a part will be referred to as a
52
segment, and it spans over a number of rows and columns.9 The DSP processes one
segment at a time.10
It must read the segment’s elements from the input in CM to LDM,
and write them to the output. It is desirable that the execution time spent on performing
these CM instructions is minimized. To achieve this, two things must be considered.
The first is in what manner are the CM instructions performed. The DSP should use
only one instruction to read each of the segment’s columns from the input, and only one
instruction to write each of the segment’s rows to the output. None of them should be
read or written partially or twice.11
But in order to write one of the segment’s rows to the
output all the elements of that row must be present in LDM. And since no column can be
read partially or twice, all the columns must be present in the memory to produce the row
and the other rows. This requires that the size of the segment is smaller than LDM.
The second thing to consider is the segment’s dimension. Say that its width and height
are # nrcols and # nrrows elements respectively. The number of cycles required to perform
CM instructions to process the segment is:
cyclesnrnrlatwritenrlatreadnr cmrowscolsCMrowsCMcols ***2** __
To explain the formula, # nrcols instructions must be initiated to read the entire segment
from CM. # nrrows instructions are initiated to write the segment. Besides from initiating
the instructions the elements must of course be read and written. Now, the expression is
to be minimized under the following restriction:
elementsnrnr ldmrowscols *
The restriction is too strict for practical purposes, as it uses all of the available memory in
LDM for the segment. But it suffices for this discussion. To make further mathematical
arguments more convenient, say that nrrowsx and nrcolsy and allow the two
variables to have real values. The following function
cycleslatwritelatread cmyxCMxCMyyxf ***2**, __
Formula 5:Definition of the function f for any real values x and y.
9 Figure 16 already illustrates this if one considers the figure’s entire matrix to be a section, and the
highlighted section in the figure to be a segment.
10 The verb “process” will be frequently used to state that some task is being performed. Do not confuse
this with a processing step.
11 This argument may sound a bit too restrictive but it will suffice for the final conclusion of this paragraph.
In the mean time, the reader should recall that Channel deinterleaver’s matrices have a very special
dimension.
53
is to be minimized under the following constraint
elementsldmyx *
Formula 6: The restriction that must be met by f.
Note that f is a continuous function, but it is constrained to an endless curve even if only
positive values for x and y are to be considered.
The method of Lagrange multipliers states that if f has any minimal turning point yx,
while subject to the constraint, then there is a v such that vyx ,, solves the following set
of equations:
0*
0***2
0***2
_
_
elements
cycleslatread
cycleslatwrite
ldmyx
vxcmxCM
vycmyCM
Formula 7: The equations that any minimal turning point of f solves.
However the converse implication does not have to be true. A triplet vyx ,, that solves
these equations does not necessarily yield a point of extrema yx, for f under the
constraint. Thus before solving the equations it must first be proved that there is at least
one such minimal extrema. For this purpose the following evaluations of f are made
under the constraint of Formula 6:
if f is subject to the constraint and x then f
if f is subject to the constraint and y then f
Formula 8: Evaluations made of f under the constraint of Formula 6.
f grows to infinity as x or y grows to infinity along the curve in the first quadrant. Yet f is
a continuous function. This proves f must have at least one minimal extrema in the first
quadrant when subject to the constraint. To find one, the three equations can be solved.
One solution is:
latread
latwriteelements
CM
CMldmy
_
_*
Only positive values for y are of interest. For the purpose of the processing step, consider
the largest symbol size that is six bytes. For this symbol size elementsldm should be
6
availelements
LDMldm
54
Given this the solution suggests that a segment should span over 31y columns and
65x rows. These numbers increase for smaller symbols.
The obvious problem is that the matrix of the processing step can not have that many
columns. In such case a segment should instead cover as many consecutive rows of the
matrix as LDM allows. A section too should cover a consecutive number of rows. Figure
17 demonstrates the idea for a small example. This approach is beneficial, because now
all the segment’s rows are to be written directly after one another in the output, and this
can be done using a single CM instruction.
55
Figure 17: A matrix M of the processing step with 12 columns and 36 rows. The symbol size is irrelevant. It is
first shown how it is divided into 3 sections, and a segment in the second section is highlighted. It is then shown
what parts of the input must be read to process the segment, and then what part of the output must be written to.
Recall that the input appears in CM column by column, but it is to be written row by row. 12 CM instructions are
required to read from the input, while only one is required to write to the output.
Note that this discussion only considers how much of an implementation’s execution
time is spent on performing CM instructions. In fact how the implementation otherwise
works was disregarded from. But its purpose was to show that any implementation of the
processing step should give each job DSP a set of consecutive rows to process. Otherwise
an unacceptable amount of the execution time is spent on performing many CM
instructions, no matter of how efficient the implementation is otherwise.
56
Dividing the processing step’s matrix into sections and segments in this way was
chosen as a base for an algorithm.
6.2.2 Clearing the q symbols
By dividing the matrix’s rows into equally large sections, the workload may be
unevenly divided among the job DSPs since only the sections containing the last rows
have q symbols that need to be cleared. It will hardly be notable if each section spans
over say 400 rows and there are only 25 rows with q symbols in the last section. And
even if the sections are smaller, an even workload may be maintained by not making the
sections equally large. Somehow a section that contains q symbols can differ in size from
other sections.
How they should differ depends entirely on the implementation. It might actually be
beneficial for a DSP to be assigned a section containing q symbols, since it does not have
to read those symbols from the input. For a column of the segment that contains q
symbols in the end, the implementation can read from the input in CM only the first part
of the column that does not contain any q symbols. It then clears the last part of the
column. Figure 18 illustrates the idea.
57
Figure 18: A matrix of the processing step is shown from top to bottom, but some rows have been skipped. q
symbols are marked with a “q” throughout the matrix. A segment that spans over 8 rows is shown in shaded
colors. The parts of the columns that must be read from the input in CM appear in a darker shade, while q
symbols that do not need to be read have a lighter shade.
On the other hand, each of the segment’s columns may be read completely to LDM and
then the q symbols can be cleared (before the segment is written to the output in CM).
The latter strategy was chosen for an initial implementation, but it can be switched later.
Also sections are to be equally large. If it turns out that q symbols cause a too uneven
workload balance, then the method of choosing section size can be changed to
compensate for this.
6.3 Description of the processing algorithm The processing step’s matrix is divided into equally large sections, one for each job
DSP. Each section is divided into many segments. How a segment’s size is determined
will be described shortly. Each of the segment’s columns is then read from the input in
CM to a column buffer in LDM. Each of the segment column’s symbols is then placed
from the column buffer into the segment in LDM. Once all of the segment columns have
been placed into the segment in this manner, all of its q symbols are cleared. The segment
58
is then written to CM using only one instruction. The next segment can now be
processed. Figure 19 demonstrates the procedure for a small example.
Figure 19: The figure illustrates the processing of one segment. It shows how data is moved between CM and
LDM. Specifically, the last column of the segment is read from the input to the column buffer in LDM using one
CM instruction. The symbols are then moved from the buffer to the segment by multiple LDM instructions.
Then, q symbols are cleared using such instructions too. Finally, the segment is written to the output in CM by
one CM instruction.
So in LDM there must be room for the entire segment and the column buffer. Say that
the matrix has # nrcols columns and a symbol is # sizesymbol bytes large (Paragraph 5.2.2
shows the variable’s possible values). LDM has # AVAILLDM bytes available (see
Paragraph 2.1.1). In such case the segment can span over nrrows rows, where nrrows
solves the following equation:
59
AVAILsizenrnrnr LDMsymbolrowsrowscols **
Formula 9: The equation is used to determine the size of a segment.
The benefit of this approach is that the segments are as large as possible. This
minimizes the number of CM instructions that must be performed. A possible
disadvantage is that due to the capabilities of a DSP the implementation can perhaps be
improved by processing say two of the segment’s column at a time, which would require
a larger column buffer. However this can be considered for later optimizations.
60
7 Processing algorithm for Rate dematching
This chapter presents the processing algorithm that was used for Rate dematching.
Paragraph 7.1 makes some initial observations regarding the processing step that are
relevant. Paragraph 7.2 suggests how a processing algorithm for Rate dematching can
work. The processing algorithm that was used as a base for the implementation is
presented by Paragraph 7.3 and further specified in Appendix 3.
An implementation of the processing algorithm was written during the thesis. For
details regarding the implementation and what steps were taken to verify its correctness
and to improve its performance, refer to Appendix 3.
7.1 Initial discussion This paragraph discusses the processing step beyond its specification. The purpose is to
give a new idea of how it works that makes it easier to understand how an efficient
algorithm can be written. This will be done from Paragraph 7.1.4 and forward. But its
specification relies greatly on Rate matching, and because of this it will first be presented
how the latter processing step works from a new perspective. This will be done in
paragraphs 7.1.1 to 7.1.3. A good understanding of the processing steps’ specifications is
recommended (see Paragraph 5.3).
7.1.1 Redefinition of Permutation II
Recall the function iF that is used to rearrange the bits of the array c to z in
Permutation II (see Paragraph 5.3.1.3). Suppose c is placed in a matrix
310,10 RY row by row just as Permutation I would. This enables one to see that
the function F permutes the bits of c to z in a way very similar to how Permutation I
would do it.
Suppose the function
RiRiiG %32
is used in Permutation II instead of F. This equals to placing the bits of c in Y row by
row. z is then obtained by reading the matrix’s bits column by column. To se this, note
how Ri chooses the current column while Ri%32 chooses row.
Now, suppose the function
RiRiPiH %32
61
is used instead of F (recall the function P’s definition from Paragraph 5.3.1). This would
make Permutation I and II equal. To see this, compare iG to iH for the same i in the
context of Permutation II. The former function assigns the bit in column Ri and row
Ri%32 in Y to iz . The latter function chooses the bit in the same row, but instead in
column RiP . So G produces the output by reading the matrix’s bits column by
column, while H first permutes them by using P. This is exactly what Permutation I do.
A crucial observation can be made at this stage, namely
KiHiF %1
Recall that RK 32 and that this is the number of bits in Y. The similarity of
Permutation I and II will be established in the following argument. The reader should
remember that using H instead of F in Permutation II is equal to Permutation I. Also keep
in mind that c has been placed in Y row by row.
Suppose iHc is a bit located in any of Y’s columns but the rightmost. Then
1 iHc is in the same row but in the column to the right of it.
Suppose iHc is in the rightmost column but not the bottom row. Then
1 iHc is the leftmost bit of the next row.
Finally, suppose iHc is the rightmost bit of the bottom row. Then
KiHc %1 is the leftmost bit of the top row.
Permutation II can now be defined in a more convenient way that reveals how similar it
is to permutation I.
Redefinition of permutation II: place the bits of c in the matrix Y row by row. Each
row is filled from left to right. In the leftmost column, move each cell one row up. Move
the cell that was in the top row to the bottom of the column. Now, rearrange the columns
such that after the rearrangement column i was before the rearrangement column iQ . Q
is defined as follows:
32%1 iPiQ
Formula 10: Definition of the function Q.
The bit array z is produced by reading the matrix’s bits column by column. Each
column is read from top to bottom. Figure 20 demonstrates the step for the same small
example that was followed throughout Paragraph 5.3.1 and its subparagraphs. Compare it
to Figure 11 that illustrates Permutation I.
This is the new definition of Permutation II from here and on. The reader should consult
this one when then the permutation is mentioned in the rest of this chapter. The old
definition should only be read for clarifying this one. From here and on in the report,
when it is said that the matrix’s columns are permuted as Permutation II would, then this
62
includes two things. Namely (i) how the cells of column zero are shifted up and then (ii)
the rearrangement of the order of the columns.
Figure 20: The figure demonstrates the step Permutation II when applied to the array c for 246D and
62N .12
This implies that 8R and 256K . The tables in the figure show how the matrix Y is modified
through the step. Only some of the matrix’s columns are shown. Each table shows how the matrix’s bits originate
from the array c. Those cells that only say “NULL” are bits that were inserted during the step Padding and so do
not originate from c.
12 Note that this is the same as the example that was followed throughout paragraph 5.3.1 and its
subparagraphs.
63
7.1.2 Bit collection seen from a new perspective
Before discussing the step Bit collection in Rate matching, one should recall from
where it obtains its input. Its input are the bit arrays 10 Kx , 10 Ky and
10 Kz . The two former were produced by Permutation I, while the last was
produced by Permutation II (see paragraphs 5.3.1.2 and 7.1.1). Each permutation
produced its output by reading the bits of its matrix column by column. The matrix’s
columns have been permuted before that. Say that Permutation I used the matrices
31...0,1...0 RX and 31...0,1...0 RY while Permutation II used 31...0,1...0 RZ .
Assume that the matrices’ columns have already been permuted as the two permutations
do.
Now, in the step Bit collection the bit array x is placed in the beginning of the bit array
w (see Paragraph 5.3.1.4). This equals to reading matrix X’s bits column by column and
placing them in the beginning of w.
Next in the step, the arrays y and z are interlaced bit by bit in the order
,1,1,0,0 zyzy . The bits are then placed in the remaining part of w. Interlacing the
arrays y and z equals to interlacing the bits of Y and Z column by column. The order of
the bits is
,1,0,1,0,0,1,0,1,,0,1,0,1,0,0,0,0 ZYRZRYZYZY
Formula 11: The order in which the bits of Y and Z are placed in w.
The bits are then placed in the remaining part of w instead of y and z. w is now full. The
important observation is that the matrices appear in w column by column, only that X’s
columns come first and then the columns of Y and Z come interlaced. Figure 21 shows
how the columns of X, Y and Z appear in w for a small example.
Figure 21: The figure shows how the columns of the matrices X, Y and Z appear in w. The columns of Y and Z
have been interlaced bit by bit.
7.1.3 Bit selection seen from a new perspective
In the step Bit selection in Rate demtaching, the array w is traversed circularly until
sufficiently many non NULL bits have been collected for the output array e (see
64
Paragraph 5.3.1.5). The previous paragraph showed how the bits of the matrices X, Y and
Z were placed in w. The purpose of this paragraph is to show how traversing w equals to
traversing the matrices. Recall that their columns have been permuted by permutation I
and II.
Imagine that Bit selection traverses w from its beginning.13
The previous paragraph
shows that this equals to first traversing X column by column. When the bottom bit of its
rightmost column has been reached, Y and Z are traversed in the order specified by
Formula 11. This means the matrices are traversed column by column, but first one of Y’s
bits, then one of Z’s bits, then one of Y’s bits and so on. When the last bit in this order has
been reached w has been fully traversed. In such case it is again traversed from its
beginning as described. This continues until sufficiently many non NULL bits have been
collected from the three matrices for the output array e.
However, w is not traversed from its beginning. The starting position is bit
1122 TRoffset
where T is a non-negative integer. But note that offset is always an even multiple of R.
This means beginning traversing w from bit #offset equals to skipping a number of
complete matrix columns, but only in the first traversal of w. Specifically:
If Roffset 32 then R
offset of X’s columns are skipped.
If Roffset 32 then X is completely skipped and R
Roffset
2
32 columns of each
of the matrices Y and Z are skipped.
To clarify with an example, if Roffset 46 then all of X’s columns and the first 7
columns of Y and Z respectively are skipped in the first traversal of w. In the beginning of
the next traversal, one begins from the first column of X again.
Some crucial remarks will now be made regarding in what manner the matrices’
columns appear in e. Figure 22 illustrates the observations that will be made for a small
example. The reader should consult the figure frequently as he reads on.
First, one must know where the NULL bits appear in the matrices. There are equally
many in each matrix and it is possible that there are none at all. Assume that there are
some and take X as an example. Consider it before its columns were permuted. By
reading its bits row by row, all NULL bits appear first in this ordering. This means that in
any column only the topmost bits can be NULL, and the number of NULL bits in any two
columns can differ only by one. Now, Permutation I permutes the columns of X and Y. It
only switches place between the columns. So still the same observations can be made: in
13 This is actually not possible because the variable 0offset decides the starting position, but this will
be taken into consideration later.
65
any column only the topmost bits can be NULL, and the numbers of NULL bits in any two
columns can differ only by one. Finally, Permutation II is applied to Z. It first rearranges
the bits of column 0, and then switches place between the columns. After the permutation
column 31 was originally 0. Thus the following observation can be made: the numbers of
NULL bits in any two columns can differ only by one. In any column the topmost bits can
be NULL. The bottommost bit of column 31 is NULL. No other bit anywhere in the matrix
can be NULL.
Now it can be discussed how the matrices’ bits appear in e. Remember that all of them
have been permuted, and how some columns may be skipped due to the variable offset.
Starting with X, its bits appear in e column by column. Each column appears from top to
bottom but the top NULL bits have been skipped.
Continuing with Y, its bits appear in e column by column, but in a special manner.
Consider any column i. Say that it has n NULL bits in the top. The column’s remaining
nR appear from top to bottom in e, only that they have been interlaced bit by bit with
some of Z’s bits. Specifically, the bits of Y’s column appear in e in the following order:
inY , , one Z bit, inY ,1 , one Z bit, …, iRY ,2 , one Z bit, iRY ,1
Exactly which of Z’s bits appear in the ordering will not be specified. It is noteworthy
though that they all belong to column i in Z with one possible exception. The last one
may come from column 1i .
Finally how Z’s bits appear in e will be shown. Consider any column 31i . Say that it
has n NULL bits in the top. The remaining bits appear in e in a fashion very similar to the
previous ordering presented, namely
inZ , , one Y bit, inZ ,1 , one Y bit, …, iRZ ,2 , one Y bit, iRZ ,1
All the Y bits belong to column i in Y with one possible exception. The first one may
come from column 1i . The ordering is very similar for column 31 of Z. If the column
has no NULL bits then the ordering begins by 31,0Z and ends by 31,1RZ . If there
are n NULL bits then the ordering begins by 31,1nZ and ends by 31,2RZ .
As a final note, it should be noted that e may end abruptly after any bit, no matter what
matrix it belongs to. Also one might be tempted to believe that all of the bits of Y and Z
are interlaced in e. However there is an exception, but only if the matrices have NULL
bits. Each time the two matrices are traversed, two Z bits will appear side by side in e.
The first is the bottommost bit of one column, and the second is the topmost non NULL
bit of the next column.
66
Figure 22: A small example of Rate matching is illustrated. The purpose of this figure is to show in what manner
the columns of X, Y and Z appear in e. Because of this for every bit it is only shown which matrix and column it
belongs to. The matrices’ columns have already been permuted. This is notable by how NULL bits have been
rearranged in every matrix. The example’s input parameters are 143D , 36N , 0T and 411S .
This implies that every matrix shall have 5R rows and a total of 16032 RK bits. 17DK bits
of each matrix are NULL that were introduced in the step Padding. So each matrix has a total of 5317 N
NULL bits, which leaves 107 ND non NULL bits. Note that 321107*3 S , so e traverses all of
the matrices once and then some of X. Five parts of e are shown, and note that in the middle part two bits from Z
are adjacent.
67
7.1.4 Rate demtaching seen from a new perspective
In Rate dematching the bytes of the array E are used to modify the byte arrays A, B and
C. The specification relies greatly on Rate matching (see Paragraph 5.3.2). Specifically as
shown by Algorithm 8, iE is soft combined with jA (or kB or lC ) if ie
originates from ja (or jb or kc ). But so far in this chapter it has been shown in
what manner the bits of a, b and c appear in e. A short summary follows.
First, the step Padding inserts some NULL bits in the beginning of each of a, b and c.
Then, the steps Permutation I and II place the arrays into the matrices X, Y and Z row by
row, and permute their columns. From there, it was shown how the matrices’ bits appear
in e column by column.
The same idea will now be applied to A, B and C but an obvious redefinition of the
steps of Rate matching is required. Padding, Permutation I and II actually work with bit
arrays, but suppose they would work in the same way with byte arrays. To clarify with an
example, if Padding places DK NULL bits in the beginning of a bit array then it
places equally many NULL bytes in the beginning of a byte array.14
Now, suppose the steps are applied to A, B and C. Specifically Padding is applied to the
three arrays individually. Then Permutation I places A and B into the matrices
310,10 RU and 310,10 RV respectively and permutes their columns.
Permutation II places C into 310,10 RW and permutes its columns.
Now, Algorithm 8 can be rephrased by saying that if ie originates from kjX , then
iE is to be soft combined with kjU , . Y and Z are in the same way related to V and W
respectively. But it has been shown in the previous paragraphs in what manner the bits of
X, Y and Z appear in e column by column. In the same manner, the bytes of E are to be
soft combined with the bytes of U, V and W column by column.
This technique will be used extensively and therefore calls for some terminology. If e is
long enough then X can be traversed multiple times while producing e. One says that
there are multiple repetitions of X in e. In the same manner U is to be soft combined with
multiple repetitions in E. One also says that a column repetition of U is read from E and
soft combined with a column of the matrix. It is therefore important to be mindful of the
difference between a column repetition and a column. The former is in E, while the latter
is in a matrix. The word “column” will never be used as a substitution for “column
repetition”. Also a column repetition is to be soft combined with only the non NULL
bytes of its corresponding column.
But it would be inefficient to actually permute the columns of U, V and W. Instead the
bytes of A, B and C are placed into the matrices as described but the columns are not
14 One can not argue that confusion arises since a bit array can be considered to be a byte array. The two
types of arrays are different. This is due to the strict definition of an array presented for the purpose of this
report in paragraph 1.3.4.2.
68
permuted. Now take U as an example and recall that Permutation I would use function P
to permute its columns. When reading one of U’s column repetitions from E the function
P can be used to calculate with which one of U’s columns it should be soft combined
with.
The argument presented has been lengthy and based on other arguments from previous
paragraphs. This calls for a specific example that clarifies the final idea. Figure 22 will be
used for this purpose. Remember that Permutation I has used the function P to permute
the columns of X and Y (see Paragraph 5.3.1.2). Permutation II has used the function Q to
permute the columns of Z (see Paragraph 7.1.1). Suppose Rate dematching is to be
performed for three byte arrays A, B, C and E, and that the input parameters D, N, T and S
have the same values as in the figure’s example. What values the arrays contain is
irrelevant for the purpose of this example. A, B and C are used to form the matrices U, V
and W as previously stated. However their columns are not permuted.
Now, pay attention to the bit array e in the figure. Due to the value of T, the variable
offset equals 5*22 R . So the first repetition in e belongs to X but it is only partial
because it skips the matrix’s first two columns. So the following can be said of the first
30232 column repetitions in e:
The first column repetition belongs to column 2 of X.
The next column repetition belongs to column 3 of X.
The next column repetition belongs to column 4 of X.
…
The next column repetition belongs to column 30 of X.
The final column repetition of the repetition belongs to column 31 of X.
Because of this the following can be said of the first 30232 column repetitions in
E.
The first column repetition is to be soft combined with column 82 P of U.
The next column repetition is to be soft combined with column 243 P of U.
…
The next column repetition is to be soft combined with column 1530 P of U.
The final column repetition of the repetition is to be soft combined with column
3131 P of U.
Next in e there comes a bit sequence that contains a repetition of Y, but it also contains a
repetition of Z. Y’s repetition will be regarded first. It is complete, so it contains 32
column repetitions that belong to Y. Each of them has been interlaced bit by bit with a
part of Z’s repetition. Besides from this, the following obvious observation can be made:
The first column repetition belongs to column 0 f Y.
The next column repetition belongs to column 1of Y.
…
The final column repetition of the repetition belongs to column 31 of Y.
69
Therefore it can be deduced that next in E there comes a byte sequence that contains a
repetition of V and W. Beginning by regarding V’s repetition, it is also complete with 32
column repetitions. Each of them has been interlaced byte by byte with a part of W’s
repetition. So if iE and jE are the first and last bytes of one of V’s column
repetitions, then the subarray jiE contains the entire column repetition but only
every second byte is to be used for soft combination. Keeping this in mind, soft
combination can proceed as follows:
The first column repetition is to be soft combined with column 00 P of V.
The next column repetition is to be soft combined with column 161 P of V.
…
The final column repetition of the repetition is to be soft combined with column
3131 P of V.
Likewise in the same bit sequence of e there is also a complete repetition of Z. Its
column repetitions appear in increasing order of column index too, and each one has been
interlaced bit by bit with a part of Y’s repetition. Therefore in the same byte sequence of
E there is a complete repetition of W, and its column repetitions have been interlaced byte
by byte with a part of V’s repetition. Keeping this in mind soft combination can proceed
as follows:
The first column repetition is to be soft combined with column 10 Q of W.
The next column repetition is to be soft combined with column 171 Q of W.
The next column repetition is to be soft combined with column 92 Q of W.
…
The next column repetition is to be soft combined with column 1630 Q of W.
The final column repetition of the repetition is to be soft combined with column
031 Q of W.
Finally in e there is a partial repetition of X. It covers the 29 first column repetitions, but
the last one is partial. These are to be soft combined with the columns
00 P , 161 P , …, 728 P and 2329 P
of W, but only some of the topmost non NULL bytes of column 23 bytes are soft
combined with.
In all coming discussion the matrices U, V and W effectively replace A, B and C. The
reader should remember how the former can be produced from the latter, and that the
matrices’ columns are not permuted. Each matrix has 32 columns and up to 19332 D
rows.
70
7.1.5 Remarks regarding SatFunc and soft combining
The binary function SatFunc is used to perform soft combining. It is commutative but
not associative. If a series of bytes are to be soft combined, then this must be done in the
order that they appear.
7.1.6 Typical input parameters for Rate demtaching
Typical input parameters of Rate demtaching are important to know before a processing
algorithm is discussed. The matrices are to be cleared if Rate dematching’s input
parameter CLEAR indicates this. This happens roughly nine out of ten times. The input
parameter T is hardly ever anything else but zero ([2], Table 8.6.1-1). This means that
mostly only the two first column repetitions of U are skipped in E.
The bit array w in Rate matching contains DN 3 non NULL bits, and so it is
traversed SDN 3 times when producing e. The same ratio multiplied by 32 tells how
many column repetitions of a matrix can be found in E, although some of them could be
in partial repetitions of the matrix. So to clarify if 13 SDN and 0T then there
are exactly 32 column repetitions in E belonging to U. Thirty of them appear first in E in
a partial repetition of U, while the last two appear last in E in another partial repetition.
The value of this ratio can vary greatly, making it difficult to predict the number of
column repetitions that must be processed for each matrix. Its value is typically within
0.5 to 1, but it can be as low as 0.38 or as high as 6 ([2], Table 7.1.7.2.1-1). However
higher values than 1.1 rarely occur and do so only for small matrices and E. This means
that it hardly ever happens that a column of a matrix is to be soft combined with more
than one column repetition.
7.2 Suggestion for processing algorithm A processing algorithm for Rate demtaching is suggested in this paragraph. First some
interesting observations are made that can be used to write an efficient algorithm. Then
different strategies for dividing the work among the job DSPs are discussed.
The algorithm will use the matrices U, V and W and place the byte arrays A, B and C
into them. Each array is initially located in CM when the processing step is to be
performed. DK NULL bytes must first be inserted into the array’s beginning, but this
happens to be an even number of bytes. Suppose a DSP is to process A. For this it can
reserve an array of 2K words in its LDM. The first 2DK words are considered to
be NULL and A is read from CM into the rest of the words.
7.2.1 The benefit of clearing U, V and W
If each matrix is to be cleared due to the value of CLEAR, then once this has been done
its 32 first column repetitions in E can be directly inserted into it without performing soft
combination. This is because of the simple observation iiSatFunc 0, for any byte i.
71
7.2.2 Efficient soft-combining of multiple matrix repetitions
Suppose that the first column repetition of U’s column t in E is the subarray jiE .
The column is to be soft combined with it, but then it must also be soft combined with
NDjNDiE 33 , NDjNDiE 66 ,
NDjNDiE 99 , NDjNDiE 1212 , …
and so on until the end of the array. This is because there can be multiple repetitions of
U and between them there are repetitions of V and W. Then all the mentioned subarrays
of E can be read from CM to LDM. The benefit of this is that if kU is in column t and it
is to be soft combined with lE , then it must also be soft combined with
NDlE 3 , NDlE 6 , NDlE 9 , NDlE 12 , …
but these bytes of E have already been read to LDM. Therefore kU can be read from
LDM, soft combined with them all, and then be written back to LDM. Compare this to
being forced to reading and writing the byte to the memory once per soft combination.
The method described needs to be a bit further elaborated before it is practically usable.
However this will not be further specified because it would still be of no use due to what
is stated by Paragraph 7.1.6. It rarely happens that any matrix has multiple repetitions in
E, and even then the matrices and E are very small. It is not feasible to take this into
consideration for an algorithm.
7.2.3 Working with bytes in words
The processing step works with bytes, but a byte can not be addressed in CM (see
Paragraph 2.2). So when an array of bytes in CM is to be read to LDM, then the array
must first be expanded so that it encompasses complete words if this is not the case. Thus
when a CM instruction is executed that reads the resulting word array from CM to LDM,
it might be so that up to two bytes at the ends of the array are of no interest. This problem
does not make it more difficult to write an algorithm, but it makes the process of
implementing it more error-prone. It should therefore be taken into consideration when
specifying an algorithm.
7.2.4 How to parallelize
Two strategies were considered for dividing the workload among the job DSPs. They
differ in what parts of the matrices a DSP is designated to process. This includes clearing
those parts if necessary and then to soft combine them with bytes of E. In the first
strategy, each DSP is designated all the rows in one of the matrices. So only three DSPs
are used. In the second strategy, each DSP is given row i to j of each matrix. So there can
be as many DSPs as there are rows. Each DSP is then interested in a subarray of every
72
column repetition in E, no matter what matrix it belongs to. Figure 23 illustrates the two
strategies with three DSPs each.
Figure 23: The figure shows the two strategies that were considered for dividing the processing step’s workload
among the job DSPs. Both strategies are shown with three DSPs for easy comparison, although the second
strategy can use more.
Benefits of the first strategy are:
Efficient use of CM instructions and the memory in LDM.
A DSP can fit its entire matrix into LDM. A CM instruction can then read as
large part of the matrix’s repetition as LDM allows. The number of required CM
instructions is thus small.
No synchronization required.
A DSP processes its own matrix irrespective of what the others do. It reads the
repetition of the matrix from E and when it is done it writes the matrix back to
CM. The work does not need to be synchronized.
Simpler processing algorithm.
A simpler algorithm does not necessarily mean a more efficient implementation.
But the algorithm’s simplicity must be taken into consideration because writing
an implementation for Rate demtaching is already very error-prone. Bytes can
not be addressed in CM (see Paragraph 7.2.3), and the columns of V and W have
73
been interlaced in E in a special way (see 7.1.3). These are only two concerns
that make an implementation susceptible to errors. For this strategy, the required
CM instructions that transfer data between CM and LDM have been described.
It is relatively easy to calculate what parts of E are of interest for a DSP (namely
those where the matrix’s repetitions are located), and the DSP works with
complete columns.
Better improvements by optimizing critical loops.
If an implementation is to perform the soft combination column by column, then
this strategy is preferable for it works with complete columns. A critical loop
that soft combines a column will execute for a longer duration of time per
column it processes. This increases the amount of time that the implementation
spends in the loop. Optimizing the loop gives therefore greater reduction in the
implementation’s execution time.
The disadvantages of the first strategy are:
The number of job DSPs is restricted.
One DSP can first process one matrix and then another, but there can not be
more than three of them sharing the workload.
Bad workload balancing due to the interlaced bytes of V and W.
A DSP that is to process U reads a repetition of it from E. It is interested in
every byte that it reads. But a DSP that reads a repetition of V or W from E is
only interested in roughly every second byte, due to how the matrices’ bytes
have been interlaced. The CM instructions of the latter case take more time to
execute, but the same problem occurs when considering LDM instructions.
When the DSP has placed a repetition of the matrix into its LDM and begins to
read sequences of bytes from it into its temporary registers, it is again only
interested in every second byte. The critical loop will therefore require more
time to execute.
Bad workload balancing due to where E ends.
E may end abruptly anywhere in the repetition of a matrix. If there is a complete
repetition of U but only one column repetition of V and W in E, then two of the
DSPs will finish executing much faster than the third.
Benefits of the second strategy are:
Only practical restrictions on the number of DSPs.
There can theoretically be as many of them sharing the workload as each matrix
has rows.
No synchronization required.
No synchronization is required for this strategy either.
Good workload balancing.
The previous strategy had workload balancing issues due to how the column
repetitions of V and W have been interlaced in E. But in this strategy a DSP is
given the rows i to j of U, V and W respectively. So each DSP must participate
in processing V and W that is more laborious than U. Also E can end anywhere
in a matrix’s repetition but the workload will still be nearly equal for the DSPs.
This is because for each column repetition in E, every DSP is interested in a part
74
of it. But in the previous strategy only one DSP is interested in that column
repetition.
No unnecessary bytes read from CM.
The repetitions of V and W in E have been interlaced. Suppose the subarray
lkE is retrieved from CM to be soft combined with the rows i to j of
column c in V. Only every second byte of the subarray is of interest for this task.
But Paragraph 7.1.3 shows how nearly all the others bytes are to be soft
combined with column d of W, and in fact most of them belong to rows i to j
too. The method needs to be carefully elaborated further if it is going to be used.
But in this strategy a DSP is to process the rows i to j in both of the columns c
and d. For one repetition of V and W in E, it can obtain the bytes that it requires
for this task by using a single CM instruction.
The disadvantages of the second strategy are as follows:
More CM instructions, and inefficient use of them.
A DSP must still process every column of the matrices, even if only rows i to j
are covered in each column. Consider a complete repetition lkE of a matrix.
Since a DSP processes a part of each column, it needs 32 separate subarrays of
the repetition. 32 CM instructions must therefore be executed to retrieve them to
LDM. The subarrays become shorter the more DSPs share the matrices’ rows.
This makes the read latency of each CM instruction more significant compared
to the time spent on reading actual data.
More complex processing algorithm.
For each repetition of a matrix in CM, it must be calculated where the 32
subarrays that a DSP requires are located. Only performing these calculations is
error-prone. But retrieving each of them to LDM makes it worse due to the
reasons stated in Paragraph 7.2.3. It was previously suggested that a single CM
instruction can obtain bytes that are to be soft combined with a column of V and
W respectively. However this too requires careful calculations.
More inefficient implementation and less beneficial optimizations.
As mentioned, this strategy requires more calculations to find the subarrays of E
that are of interest to a DSP. Also, a DSP covers only a part of every column.
Suppose that in the algorithm a critical loop is to soft combine one column each
time it is executed. If at least two DSPs are used then the execution time of this
critical loop is less than a half of the critical loop of the previous strategy. This
makes it less beneficial to optimize it. One can argue that the loop can therefore
process more than one column each time it is executed, but the same argument
can be made for the other strategy.
Perhaps the best benefit of the second strategy is a good workload balancing. But it is
believed that if it uses three DSPs as the first strategy, then the latter processes the same
input quicker. Estimations were made for the amount of time required to execute CM
instructions by each strategy. Longer execution time is not an acceptable price for better
workload balancing.
75
On the other hand, the second strategy can divide the work among multiple job DSPs
thus reducing execution time. But it will not be reduced significantly. The matrices’
columns are only 193 rows long. If three DSPs are used then each of them is to process
roughly 60 rows. It is true that the number reduces the more DSPs are used, but there is a
constant amount of work that must be performed by each DSP for each column repetition
in E. This becomes more notable as more DSPs are used.
Yet the most convincing reason speaking in favor of the first strategy is related to a
future version of EMPA. For reasons that are omitted here, a new processing step will be
required immediately after Rate dematching is done with A, B and C (see [10] Paragraph
3.3.4). The processing step permutes the three byte arrays individually. Specifically, it
places the bytes 140 DA in a matrix row by row, and the output is produced by
reading it column by column. The last bytes 14 DDA that were omitted are then
appended to the output.
The matrix’s dimension in the processing step depends on D and the details are omitted.
It is notable though that its number of columns greatly exceeds 32 for all but very small
values of D. For example if 6148D then it will have 384 columns and 16 rows.
By using the first strategy, once a job DSP has performed Rate demtaching over its byte
array that is in its LDM, it can immediately proceed to perform this processing step. No
new JOB dispatch is required.
Using the second strategy, each job DSP must first write its rows of the matrices to CM
and then they all must synchronize at a barrier before they can proceed with this new
processing step. Barrier synchronization is very costly due to the reasons stated in
Paragraph 3.4 because it can only be implemented by performing a new JOB dispatch.
In light of these important differences, the first strategy was chosen as a base for a
processing algorithm.
7.3 Description of the processing algorithm The algorithm uses three job DSPs. Each one is designated to process one of the byte
arrays A, B or C, and it must clear the array if necessary and then soft combine it with E.
All input parameters, including the four arrays, are in CM when the JOB begins.
Suppose a DSP is to process A. When it begins executing the JOB, it first allocates
memory in its LDM for the matrix U. It then reads A from CM and places it into U as
described in Paragraph 7.2. It proceeds by reading repetitions of U in E from CM into
LDM and performing the soft combination. The soft combination is done column by
column as described in Paragraph 7.1.4. Paragraph 7.2.1 shows how a more efficient
algorithm can be written if the matrix is to be cleared. However this was not considered
in this algorithm, but it was utilized later when the implementation was optimized.
76
8 Conclusion This chapter concludes the report by summarizing the thesis’ results.
8.1 Implementation of Channel deinterleaver A parallelized implementation for the processing step Channel deinterleaver that can
run on EMPA has been written throughout the thesis (Chapter 2 describes EMPA and
Chapter 6 describes the processing algorithm). Its correctness has been verified to the
extent stated in Paragraph 8.3. Its execution time is linear compared to the input size. Its
memory usage with respect to the Common Memory is linear compared to the input size
(Paragraph 2.2 describes the Common Memory of EMPA). Specifically the
implementation requires a sequential portion of the memory that is equally large as the
input. The complexity of its execution time and memory usage is irrespective of the
number of job DSPs that perform the processing step (Paragraph 3.2 describes EMPA’s
job DSPs and their role in executing processing steps).
There can be as many job DSPs that execute the implementation as the matrix in the
implementation’s input has rows (Paragraph 5.2.2 specifies Channel deinterleaver and
explains what matrix, while Paragraph 6.2.1 explains how the work is divided among the
job DSPs). However, Appendix 2.3.3 shows that using more than eight job DSPs is
infeasible for even large input.
The implementation was written in C code. Its execution time was improved for all
possible inputs by optimizing C code (Appendix 2.3.2 describes the optimizations
performed in C code). The execution time was further improved for certain types of input
by optimizing assembler code (Appendix 2.3.4 explains what types, and describes the
optimizations that were performed in assembler code). To give an idea of to what extent
the implementation was optimized, the following table compares the performance of the
implementation’s first version to the final version. This spans over both the optimizations
performed in C code as well as assembler code. The two versions’ execution times have
been measured for some test cases that are specified by Appendix 2.2.2. The execution
times tell how long it would take to execute each test case on EMPA, however their exact
values can not be revealed and they have therefore been normalized to the smallest one in
the table.
Table 5: The table compares the performance of the first version of the implementation to the final version.
The execution times of some test cases are shown for respective version. However, all the execution times
have been normalized to the smallest of them.
Test case Execution time for the
first version
Execution time for the
final version
Final execution time
compared to first
perf_600_2 13.80 1 7.25%
perf_600_6 20.85 2.33 11.15%
perf_600_6_q 29.79 2.07 6.95%
77
None of the implementation’s critical loops are optimal (Chapter 4 specifies what is
required by a loop that is executed by a DSP to be optimal). Appendix 2.4 suggests how
the implementation can be further improved, which would bring most of its critical loops
to optimality and the rest of them closer to optimality.
8.2 Implementation of Rate dematching A parallelized implementation for the processing step Rate dematching that can run on
EMPA has been written throughout the thesis (Chapter 2 describes EMPA, while Chapter
7 describes the processing algorithm). Its correctness has been verified to the extent
stated in Paragraph 8.3. Its execution time is linear compared to the input size. Its
memory usage with respect to the Common Memory is constant compared to the input
size, if one disregards from the amount of memory required to store the input (Paragraph
2.2 describes the Common Memory of EMPA). Specifically the implementation requires
a small portion of the memory and its size is irrespective of the input’s size. The
complexity of its execution time and memory usage is irrespective of the number of job
DSPs that perform the processing step (Paragraph 3.2 describes EMPA’s job DSPs and
their role in executing processing steps).
No more than three job DSPs can be used to execute the implementation (Paragraph
7.2.4 explains how the work is divided among the job DSPs).
The implementation was written in C code. Its execution time was improved for all
possible inputs by optimizing C code and then assembler code (paragraphs 3.3.1 and
3.3.2 in Appendix 3 describe the optimizations). To give an idea of to what extent the
implementation was optimized, the following table compares the performance of the
implementation’s first version to the final version. The two versions’ execution times
have been measured for some test cases that are specified by Appendix 3.2.2. The
execution times tell how long it would take to execute each test case on EMPA, however
their exact values can not be revealed and they have therefore been normalized to the
smallest one in the table.
Table 6: The table compares the performance of the first version of the implementation to the final version.
The execution times of some test cases are shown for respective version. However, all the execution times
have been normalized to the smallest of them.
Test case Execution time for the
first version
Execution time for the
final version
Final execution time
compared to first
matrix_cleared_U 44.06 1.80 4.08%
matrix_cleared_V 44.77 2.21 4.95%
matrix_cleared_W 44.16 2.23 5.06%
matrix_U 42.26 2.50 5.92%
matrix_V 42.98 2.96 6.89%
matrix_W 42.37 2.99 7.06%
overhead_W 1.85 1 54.01%
78
None of the implementation’s critical loops are optimal (Chapter 4 specifies what is
required by a loop that is executed by a DSP to be optimal). Appendix 3.4 suggests how
the implementation can be further improved, which would bring all of its critical loops
closer to optimality, but not fully.
8.3 The implementations’ correctness ELTE provides test cases that can be used to verify that an implementation of a
processing step complies with the step’s specification. Why these tests are considered to
be correct and how they are produced is omitted, but it is noteworthy that the division of
ELTE where they are produced is different from the division where the implementations
are written.
Such tests have been used into verify that the implementation of Channel deinterleaver
works correctly. Yet more tests were produced in the thesis as specified in Appendix
2.2.1. The reason is that the tests provided by ELTE were few in number and even worse
is that they test the implementation for only large input. The author deemed that even
though ELTE’s tests can be used to verify that the processing step’s specification has not
been misunderstood while writing the processing algorithm, they could not be relied on to
verify if any mistakes had been done while writing the implementation.15
But
nevertheless the implementation works correctly with respect to ELTE’s tests.
However the tests that ELTE provide for verifying an implementation of Rate
demtaching could not be used. The reason was that every test’s output is produced from
the input by applying Rate demtaching and then a following processing step. There was
not enough time to understand the step, nevertheless writing even a simple
implementation for it. The author could therefore only rely on the tests that he produced
as specified in Appendix 3.2.1. They are many in number and were produced by a
separate implementation of the processing step. It is not a parallelized implementation
and it was developed by following the step’s specification as closely as possible. Only at
a few occasions was its performance improved with caution since the amount of time it
would otherwise require to produce the tests was unacceptable.
But there is still no guarantee that the author did not misunderstand the processing
step’s specification when writing this simple implementation, nevertheless when writing
the parallelized implementation that is part of the thesis result. Thus it is up to ELTE to
further verify the correctness of the parallelized implementation and do corrections if it
does not follow the specification the way they believe to be correct.
15 Such mistakes are more informally known as “bugs”.
79
References [1] 3GPP; 2008; 3GPP Technical Specification 36.212, version 8.4.0
[2] 3GPP; 2009; 3GPP Technical Specification 36.213, version 8.6.0
[3] Siddhartha Chatterjee, Sandeep Seen; 2000; Cache-Efficient Matrix Transposition
[4] Erik Dahlman, Stefan Parkvall, Johan Sköld, Per Beming; 2008; 3G Evolution,
HSPA and LTE for Mobile Broadband
[5] S. D. Kaushik, C.-H. Huang, R. W. Johnson, P. Sadayappan, J. R. Johnson; 2004;
Efficient Transposition Algorithms for Large Matrices
[6] Sriram Krishnamoorthy, Gerald Baumgartner, Daniel Cociorva, Chi-Chung Lam, P.
Sadayappan; 2003; On Efficient Out-of-core Matrix Transposition
[7] Sriram Krishnamoorthy, Gerald Baumgartner, Daniel Cociorva, Chi-Chung Lam P.
Sadayappan; 2005; Efficient Parallel Out-of-core Matrix Transposition
[8] Rolf Redhul; 2003; Radio School, G1 Mobile radio communication, Overview
[9] Base station’s processing steps specification; ELTE internal document
[10] EMPA software interface; ELTE internal document
[11] DSP Specification; ELTE internal document
80
Appendices Here follows the appendices of the report. Appendix 1 briefly explains what tools were
used to write implementations during the course of the thesis. Appendix 2 and Appendix
3 are devoted to specifying in detail each processing algorithm that was implemented,
and also discuss how the implementations underwent when it comes to testing and
optimizations.
As Appendix 1 will explain a certain compiler was used to produce implementations for
EMPA. Appendix 4 reviews the assembler code produced by the compiler and how
efficiently it uses EMPA’s capabilities.
Appendix 5 suggests hardware changes to EMPA that can make it easier to write
efficient implementations.
To avoid confusion when referring to paragraphs in the appendices, if the paragraph is
in the appendices then its number is simply stated. However if the paragraph is in one of
the report’s main chapters, then the chapter number is stated first. For instance
“Paragraph 2.2” refers to a paragraph in Appendix 2, while “Chapter 2, Paragraph 2.2”
naturally refers to a paragraph in one of the report’s main chapters.
81
Appendix 1. Tools used to write implementations
ELTE has developed a compiler that compiles C code to execute on EMPA. This
compiler will from now on be referred to as the EMPA Compiler (EMC).
A copy of EMPA is not available to execute implementations and see if they work
correctly. Instead a simulator is used that can run on a host computer with UNIX. This
simulator will from now on be referred to as the EMPA Simulator (EMS). All the
capabilities of EMPA are simulated, including concurrent execution of multiple DSPs.
The number of DSPs can be chosen freely. The host computer is not required to have
multiple cores for this. The execution time between any two points of the implementation
can be precisely measured as if it was running on EMPA.
Data from files on the host computer can be written to CM in the simulator. This can
then be used as input to an implementation. To verify the correctness of the
implementation, the results that it produces in CM can be compared to contents of files
on the host computer. EMS constitutes the backbone of the debugger that is used to find
errors in implementations.
Therefore EMS was greatly used in the thesis to verify correctness and measure
performance of implementations that were written.
82
Appendix 2. Channel deinterleaver This appendix is devoted to specifying the processing algorithm that was used as a base
for the implementation of Channel deinterleaver, and to explain how the implementation
underwent.
It is Paragraph 2.1 that specifies the processing algorithm. Paragraph 2.2 describes the
test cases that were used for verifying correctness and performance while writing the
implementation.
Paragraph 2.3 explains what steps were taken to optimize the implementation. The
improvements are described step by step, and how they changed the performance is
measured with the test cases. The implementations final performance is also presented.
The appendix is concluded by Paragraph 2.4 that explains how the implementation
should be improved in the future.
2.1 Specification of the processing algorithm The processing algorithm will now be presented by specifying the functions that it uses
(refer to Chapter 6, Paragraph 6.2 to understand this paragraph). This is done in the
following subparagraphs. It is noteworthy that these functions are also used in the first
version of the implementation. The functions are deint_master, deint_slave,
deint_section, deint_segment and deint_col. The three latter process the section.
deint_master is the master function and deint_slave is the entry function (see Chapter 3,
Paragraph 3.2 for how a JOB is dispatched). None of the functions have an explicit return
value. Since this is the first processing algorithm to be presented, the dispatch procedure
performed by the master and entry functions will be thoroughly described.
2.1.1 Specification of deint_master
This is the master function. The function does not calculate how the matrix is divided
into sections. Instead each job DSP calculates where its own section is. This is done to
dispatch the JOB only once and to reduce the amount of work performed by the dispatch
DSP executing the master function. Once the dispatch has been performed, the function
awaits a message that notifies it that the JOB has been completed.
The function’s input is as follows:
CM_input, a pointer to the beginning of the matrix in CM.
CM_output, a pointer to the beginning of where the matrix is to be written in CM.
nr_cols, the number of columns the matrix has.
nr_rows, the number of rows the matrix has.
symb_size, the number of bytes per symbol.
nr_q_symbs, the number of q symbols in the matrix.
nr_DSPs, the number of job DSPs that the JOB is to be dispatched to.
The function works as follows:
83
Algorithm 9: The algorithm specifies how deint_master works.
deint_master (
CM_input, CM_output, nr_cols, nr_rows, symb_size, nr_q_symbs, nr_DSPs
) {
write CM_input, CM_output, nr_cols, symb_size, nr_rows,
nr_q_symbs, nr_DSPs to CM at location CM_deint_input;
dispatch the JOB deint_slave(CM_deint_input) to #nr_DSPs job DSPs;
await a message with signal job_done;
return;
}
2.1.2 Specification of deint_slave
This is the entry function that is the first function to be initiated on every job DSP that
participates in the JOB.
What section of the matrix the DSP is to work with is calculated by (i) reading the input
parameters stored in CM by the master function and (ii) reading the unique worker id that
has been assigned to the DSP and is located in the received message’s header.
The function uses two small functions, namely (i) calculate_section_position to
calculate the section’s position and size and (ii) calculate_segment_size to calculate the
maximum size a segment can have. These are specified by Algorithm 11.
In the end of the entry function the DSP checks if it has finished its section last. If so
then a message is sent back to the dispatch DSP running the master function, notifying it
that the JOB has been completed. The job DSPs coordinate this by using a lock that has
been initiated by the master function. The details have been omitted.
The function’s input is as follows:
CM_deint_input, a pointer to CM where the master function stored the input
parameters.
The function works as follows:
84
Algorithm 10: The algorithm specifies how deint_slave works.
deint_slave (CM_deint_input) {
Read the variables CM_input, CM_output, nr_cols, symb_size, nr_rows,
nr_q_symbs and nr_dsps from CM at location CM_deint_input;
//See the input of deint_master for an explanation of each.
read the variable worker_id (0 <= worker_id < nr_dsps)
from the dispatch message;
//Calculate on what row the DSP’s section begins and
//how many rows it spans over.
(sec_start_row, sec_nr_rows) =
calculate_section_position (nr_rows, worker_id, nr_dsps);
//Calculate the maximum number of rows a segment can span over.
seg_nr_rows = calculate_segment_size (nr_cols, symb_size, ldm_size);
//Reserve memory in LDM for the segment.
reserve in LDM seg_nr_rows*nr_cols*symb_size bytes and
let LDM_seg be a pointer to the location;
//Reserve memory in LDM for the column buffer.
reserve in LDM seg_nr_rows*symb_size bytes and
let LDM_col_buff be a pointer to the location;
//Process the section.
deint_section (
CM_input, CM_output, nr_cols, symb_size, nr_rows, nr_q_symbs,
sec_start_row, sec_nr_rows, seg_nr_rows, LDM_seg, LDM_col_buff
);
if (this DSP is the last one to finish) {
send a message with signal job_done to
the dispatch DSP that dispatched this JOB;
}
return;
}
The following algorithm specifies the two functions that were used by deint_slave:
85
Algorithm 11: The two functions that are used by deint_slave are specified.
//The function calculates where the section of a job DSP begins and
//over how many rows it spans.
calculate_section_position (nr_rows, worker_id, nr_dsps) {
divide the rows of the matrix in #nr_dsps sections,
where the largest section is no more than
one row larger than the smallest;
choose section #worker_id by assigning
the variables sec_start_row and sec_nr_rows;
return sec_start_row, sec_nr_rows;
}
//The function calculates the maximum number of rows
//that a segment can span over.
calculate_segment_size (nr_cols, symb_size, ldm_size) {
seg_nr_rows = round_down (ldm_size/(symb_size*nr_cols + symb_size));
//This is a solution to the equation of Formula 9.
return seg_nr_rows;
}
2.1.3 Specification of deint_section
This function processes a section. It divides the section into multiple segments and
invokes the function deint_segment on each of them. When a segment has been
processed deint_section writes it to the output in CM. All segments will contain as many
rows as specified by the input, except the last one. It will be as small as required to cover
the end of the section.
Input parameters that have not previously been described are:
sec_start_row, the matrix’s row where the section starts.
sec_nr_rows, how many rows the section spans over.
seg_nr_rows, the maximum size of a segment.
LDM_seg, a pointer to the segment in LDM.
LDM_col_buff, a pointer to the column buffer in LDM.
The function works as follows:
86
Algorithm 12: The algorithm shows how deint_section works.
deint_section (
CM_input, CM_output, nr_cols, symb_size, nr_rows, nr_q_symbs,
sec_start_row, sec_nr_rows, seg_nr_rows, LDM_seg, LDM_col_buff
) {
divide the matrix from row #sec_start_row to
row (sec_start_row+sec_nr_rows-1) into multiple segments,
each of them with #seg_nr_rows rows except the last that may be smaller;
for (each segment) {
//The current segment starts at row #seg_start_row
//and spans over #curr_seg_nr_rows rows.
deint_segment (
CM_input, LDM_seg, LDM_col_buff, nr_cols, symb_size,
nr_rows, nr_q_symbs, seg_start_row, curr_seg_nr_rows
);
//deint_segment has processed the current segment
//and the result has been stored in LDM at LDM_seg.
write the segment located at LDM_seg to the correct position
of the output at CM_output using one CM instruction;
}
return;
}
2.1.4 Specification of deint_segment
This function processes one segment. For each column of the segment, it invokes
deint_column. In a last step, all the segment’s q symbols are cleared.
Input parameters that have not previously been described are:
seg_start_row, the matrix’s rows where the segment begins.
curr_seg_nr_rows, how many rows the segment spans over.
The function works as follows:
87
Algorithm 13: The algorithm specifies how deint_segment works.
deint_segment (
CM_input, LDM_seg, LDM_col_buff, nr_cols, symb_size,
nr_rows, nr_q_symbs, seg_start_row, curr_seg_nr_rows
) {
for (col_id = 0; col_id < nr_cols; col_id++) {
deint_column (
CM_input, LDM_col_buff, LDM_seg, nr_cols, col_id,
symb_size, nr_rows, seg_start_row, curr_seg_nr_rows
);
}
for (each row in the segment containing q symbols) {
for (each q symbol in the row) {
for (each word of the q symbol) {
set the word to 0;
}
}
}
return;
}
2.1.5 Specification of deint_column
This function reads one of the segment’s columns from CM to the column buffer in
LDM. The column’s symbols are then placed into the segment.
Input parameters that have not previously been described are:
col_id, the column of the segment that is to be processed.
The function works as follows:
88
Algorithm 14: The algorithm specifies how deint_column works.
deint_column (
CM_input, LDM_col_buff, LDM_seg, nr_cols, col_id,
symb_size, nr_rows, seg_start_row, curr_seg_nr_rows
) {
//The input is CM_input, LDM_col_buff, LDM_seg, nr_cols, col_id,
//symb_size, nr_rows, seg_start_row and curr_seg_nr_rows.
read column #col_id from the input in CM (located at CM_input)
to the column buffer in LDM (located at LDM_col_buff);
for (each symbol in the column buffer) {
for (each word of the symbol) {
place the word at its position in the segment at LDM_seg;
}
}
return;
}
2.2 Test cases used for writing implementation Throughout this report, a test case is when an implementation is executed once for a
certain input that the test specifies. The purpose of the test may be to verify the
implementation’s correctness or performance. If it is the former then the test also
specifies the output that is believed to be correct.
Three kinds of test cases were used while writing the implementation of the processing
algorithm that has been suggested. These are tests that (i) verify an implementation’s
correctness, (ii) measure an implementation’s performance and (iii) measure how
intermediate optimizations change an implementation’s performance. These are described
in the following subparagraphs.
The execution time of test cases that were used for measuring the implementation’s
performance can not be revealed. Throughout the chapter some tables discuss the
implementation’s performance by using these tests. In these tables the tests’ execution
times have been normalized. This means that every execution time in a table has been
divided by the shortest execution time in the same table.
2.2.1 Test cases for testing correctness
The following table shows the test cases that were used to verify the correctness of an
implementation. Each row is a “test series” that includes multiple test cases. The input
matrix is the same for all of the test cases of a series. The correctness of the
implementation is tested in two ways in each series.
89
In the first way a dispatch is performed by using the function deint_master (see
Paragraph 2.1 for a specification of the implementation’s functions). A varying number
of job DSPs are used, and there is a separate test case for each number of job DSP.
In the second way only the function deint_section is tested. No dispatch is performed
and the function is invoked on a DSP where the work is performed. The maximum
number of rows that a segment may span over is varied. A separate test case is used for
each segment size.
The main reason for using this approach was to be able to control the segment size in
some of the test cases. It was deemed that some errors in an implementation can occur if
only certain segment sizes are used. If a dispatch is performed then the entry function
deint_slave calculates what segment size deint_section should use, and this calculation
is solely based on the available memory in LDM.
Table 7: The test cases that were used to verify an implementation’s correctness.
test
series
number
of rows
number of
columns
symbol size
(bytes)
number of q
symbols
Number of job DSPs
used when performing
a dispatch
segment sizes used
when only testing
deint_section
1 1 12 2 4 1, 2 1, 2
2 1 12 4 4 1, 2 1, 2
3 1 10 2 1 1, 2 1, 2
4 1 10 4 0 1, 2 1, 2
5 1 12 6 2 1 1
6 1 10 6 3 1 1
7 5 12 4 1 1, 2, 3, … 5 1, 2, 3, ... 5
8 7 10 2 28 1, 2, 3, … 7 1, 2, 3, … 7
9 10 12 6 0 1, 2, 3, … 10 1, 2, 3, … 10
10 13 10 4 29 1, 2, 3, … 13 1, 2, 3, … 13
11 400 12 4 3999 1, 3, 5 47, 87, 121
12 400 10 2 1197 4, 7, 11 49, 99, 143
13 800 12 6 798 13, 16, 19 81, 111, 153
14 800 10 4 400 7, 17, 20 59, 96, 173
15 1200 12 6 4800 9, 15, 21 65, 103, 153
Each test series specifies an output that the implementation is expected to produce,
given the input. The output of each test series has been generated using a separate
implementation that was written in the thesis. It runs on an ordinary computer and
follows the processing step’s specification step by step. Performance was not taken into
consideration while writing it in order to make it easier to verify its correctness.
Test series 1 to 10 use small matrices. Due to this it was possible to perform the tests
using all possible segment sizes and number of job DSPs. In fact, the segment sizes and
90
requested number of DSPs in test series 1 to 4 are greater than the number of rows. It is
required that the implementation works despite this.
Test cases 11 to 15 are large and only specific segment sizes and number of DSPs could
be used, otherwise the tests would take too long to execute.
Paragraph 2.3.4 specifies optimizations that were performed by rewriting assembler
code. The optimizations were aimed at matrices with 12 columns and symbol sizes of 2
and 6 bytes. At that stage it was deemed that the test cases of Table 7 were not enough to
further verify the correctness of the implementation. Because of this, twelve new test
cases were introduced. Six of them use small matrices, but they will not be further
specified.
2.2.2 Test cases for measuring performance
The following table presents test cases that were used to measure the performance of an
implementation. Again each row is a “test series”, including multiple test cases. In a
series a varying number of job DSPs are used, and there is a separate test case for each
number of job DSP. The time each DSP requires to execute the function deint_section
for its section is measured.
Table 8: Test cases that were used for measuring an implementation’s performance.
test series number of rows number of columns symbol size
(bytes)
number of q
symbols to reset
number of job
DSPs used
1 5 12 2 0 1
2 200 12 4 100 1, 2, 3, … 8
3 600 12 2 0 1, 2, 3, … 8
4 600 12 4 0 1, 2, 3, … 8
5 600 12 6 0 1, 2, 3, … 8
6 1200 12 2 0 1, 2, 3, … 8
7 1200 12 4 0 1, 2, 3, … 8
8 1200 12 6 0 1, 2, 3, … 8
The purpose of test series 1 is to see how the implementation performs for a small
matrix. Test series 2 has been designed to see how the resetting of q symbols disturbs the
workload. Only the last section contains q symbols in each test case. As the number of
DSPs is increased, the q symbols cover a larger share of the rows of the last section.
Ultimately with 8 DSPs, each row of the last section contains q symbols.
Test series 3 to 8 use large matrices and their purpose is to measure the
implementation’s performance for varying symbol sizes.
This is a total of 57 test cases. To use them after each step of optimization would
require too much time, even if one disregards from the time required analyzing the
results. Therefore they were only used at two occasions. First when the first version of
the implementation had been completed (results presented in Paragraph 2.3.1), and then
91
when all the optimizations that were performed in C code had been completed (results
presented in Paragraph 2.3.3). The test cases presented in the next paragraph were
designed to be used more often to measure performance.
2.2.3 Test cases for measuring performance during optimization
The following table presents test cases that were used to measure how an
implementation’s performance changes due to an optimization. Each row is a test case. A
dispatch takes place in each test, and only one job DSP is used. The time it takes for the
DSP to execute the function deint_section for its section (which is the entire matrix) is
measured.
Table 9: Test cases that were used for measuring performance change due to an optimization.
test case name Number of
rows
number of
columns
symbol size
(bytes)
number of q symbols number of job DSPs
used
perf_600_2 600 12 2 0 1
perf_600_2_q 600 12 2 2400 1
perf_600_4 600 12 4 0 1
perf_600_6 600 12 6 0 1
perf_600_6_q 600 12 6 2400 1
These test cases will be extensively referred to in the paragraphs that describe the
optimizations that were performed. Because of this the tests have been named.
The tests’ sole purpose is to clearly indicate how beneficial an attempted optimization
is, without requiring too much time to execute and to analyze the results. This is why the
test cases are few in number, use only large matrices and only one job DSP each.
perf_600_4 was introduced when an optimization was done in C that favored symbol
size 4 bytes, and so was not used from the beginning. perf_600_2_q was first used when
assembler optimizations were made.
The purpose of perf_600_2_q and perf_600_6_q is to see how the optimizations affect
the resetting of q symbols. The matrices have as many q symbols as they can have.
2.3 Implementation of the processing algorithm The following subparagraphs describe the implementation of the processing algorithm
that has been specified in Paragraph 2.1. Paragraph 2.3.1 describes the performance of
the first version of the implementation. Paragraph 2.3.2 describes the optimizations that
were performed in C code and presents their respective impact on performance.
Paragraph 2.3.3 analyzes the implementation’s performance after all the C optimizations.
Paragraph 2.3.4 describes the optimizations that were made in assembler code. Finally
92
Paragraph 2.3.5 shows how the execution time improved due to the assembler
optimizations.
2.3.1 Performance of the first version of the implementation
When the first version of the implementation had been completed, its performance was
measured by using the test series presented in Paragraph 2.2.2. The following table
presents the results. Each row in the table represents a test series, and each cell in a row
shows a test case in that series. The longest and shortest execution times of the job DSPs
are shown per test case. Recall that only the time required to execute deint_column is
measured for each DSP.
Table 10: For each test case the shortest and longest execution times of the job DSPs are shown, however
the values have been normalized. Every execution time in the table has been divided by the shortest
execution time.
Number of job
DSPs
1 2 3 4 5 6 7 8
Test series
1 1
2 43.71 22.89
20.85
15.91
14.08
12.62
10.58
10.56
8.53
9.13
7.09
8.10
6.06
7.48
5.45
3 83.21 41.62
41.62
27.85
27.85
20.97
20.97
16.83
16.83
14.08
14.08
12.15
12.02
10.64
10.64
4 103.80 52.05
52.05
34.62
34.62
26.05
26.05
20.90
20.90
17.47
17.47
15.06
14.89
13.18
13.18
5 124.40 62.22
62.22
41.67
41.67
31.13
31.13
24.96
24.96
20.85
20.85
17.98
17.77
15.72
15.72
6 166.11 83.21
83.21
55.40
55.40
41.62
41.62
33.36
33.36
27.85
27.85
24.00
23.86
20.97
20.97
7 207.56 103.80
103.80
69.21
69.21
52.05
52.05
41.76
41.76
34.62
34.62
29.82
29.65
26.05
26.05
8 248.75 124.40
124.40
83.03
83.03
62.22
62.22
49.89
49.89
41.67
41.67
35.92
35.71
31.13
31.13
The execution times of the job DSPs in any test case are equal except for two kinds of
test cases. The first is tests where seven DSPs are used. The matrix’s number of rows can
not be evenly divided among the sections, and so some of them are one row longer than
others.
The second is test cases belonging to series 2. In each of them the DSP that clears q
symbols has more work to do than the others.
The only test case of series 1 requires many cycles per symbol. The number would be
worse if there were fewer symbols in the matrix. The test’s entire input and output both
93
fit in LDM. It is not necessary to read the input column by column from CM (as this
implementation does), and instead the entire matrix can be read with one single CM
instruction. If small matrices will be common, then the execution time for them can be
greatly improved by writing a special implementation for them.
Test series 2 shows that uneven load balancing due to q symbols is a practical problem.
In each of its test cases the longest execution time belongs to the DSP that clears q
symbols. The execution times among the remaining DSPs are nearly equal. When using
eight DSPs, the one that clears q symbols requires 37% more time than the others. This
must be addressed by incorporating one of the ideas suggested in Chapter 6, Paragraph
6.2.2.
In test series 3 to 8 the work is well divided among the job DSPs. Compare the
execution time of any test case where one DSP is used to the test case in the same series
where eight DSPs are used. The latter is only slightly more than one eighth of the former.
2.3.2 Major optimizations performed in C code
In this paragraph it will be described how the implementation was optimized in C code.
Some of the test cases presented in Paragraph 2.2.3 were used to measure performance
changes. The first version of the implementation performed as the following table shows.
The following paragraphs describe the optimizations one by one.
Table 11: The table shows how the first version of the implementation performed for some test cases. The
implementation’s execution times for the tests were measured in cycles, but the values have been
normalized.
Test case Execution time
perf_600_2 1
perf_600_6 1.51
perf_600_6_q 2.16
2.3.2.1 Using specific symbol sizes in deint_column
The innermost loop of deint_column currently places a symbol from the column buffer
into the segment, but it transfers only one word per iteration. There are only three symbol
sizes, and so the loop was replaced by a conditional statement with three outcomes. See
the following algorithm and compare to Algorithm 14:
94
Algorithm 15: The algorithm shows how the symbols are placed from the column buffer into the segment
after the optimization.
for (each symbol in the column buffer) {
if (symbol size is 2 bytes) {
place the word at its position in the segment in LDM;
} else if (symbol size is 4 bytes) {
place the first word at its position in the segment in LDM;
place the second word at its position in the segment in LDM;
} else { //symbol size is 6 words
place the first word at its position in the segment in LDM;
place the second word at its position in the segment in LDM;
place the third word at its position in the segment in LDM;
}
}
Depending on the symbol size, the conditional statement places the correct number of
words into the segment. This makes better use of the DSP’s short instruction pipeline, for
the conditional branch instruction that was performed once per iteration of the innermost
loop has been removed. Other instructions related to the loop’s overhead have also been
removed. The innermost loop of deint_segment that clears a q symbol one word per
iteration was also replaced by a conditional statement in the same manner.
The following table shows how the implementation’s performance changed.
Table 12: The table shows how the optimization changed the implementation’s performance. The test
cases’ new execution times are being compared to the latest measured in Table 11. All presented execution
times have been normalized to the smallest one.
Test case Execution time before
the optimization
Execution time after the
optimization
New execution time
compared to before
perf_600_2 1.70 1 59%
perf_600_6 2.58 1.07 41%
perf_600_6_q 3.68 1.84 50%
2.3.2.2 Unrolling deint_column’s loop
The new loop of deint_column, shown in Algorithm 15, should be unrolled so it
processes more symbols per iteration. This is to increase the amount of work performed
per iteration and thus make the loop’s overhead less significant. A new loop must be
placed after it that processes any remaining symbols. But leaving the conditional
statement shown in Algorithm 15 within the loop would delay the instruction pipeline
due to branch instructions and this is not necessary. Specifically, the loop was replaced
by a conditional statement which chooses among three outcomes depending on symbol
size. In each outcome, an unrolled loop that processes five symbols per iteration is
95
executed first. Then, a loop that processes the remaining (maximum four) symbols is
executed. Thus there are a total of six loops. The following algorithm clarifies this for
symbol size 2 bytes, and the same approach is used for the other symbol sizes.
Algorithm 16: The algorithm shows how the symbols are placed from the column buffer into the segment
after the optimization.
if (symbol size is 1 word) {
...
} else if (symbol size is 2 words) {
for (symbols 5*i to 5*i+4 in the column buffer, where i = 0, 1, 2, ...) {
place symbol 5*i into the segment word by word;
place symbol 5*i+1 into the segment word by word;
place symbol 5*i+2 into the segment word by word;
place symbol 5*i+3 into the segment word by word;
place symbol 5*i+4 into the segment word by word;
}
for (each remaining symbol in the column buffer) {
place the symbol into the segment word by word;
}
} else { //symbol size is 3 words
...
}
No changes were made to how q symbols are cleared. Execution time changed as the
following table shows. Test case perf_600_4 was introduced at this stage for the next
optimization.
Table 13: The table shows how the optimization changed the implementation’s performance. The test
cases’ new execution times are being compared to the latest measured in Table 12. All presented execution
times have been normalized to the smallest one. The test perf_600_4 has not previously been used.
Test case Execution time before
optimization
Execution time after
optimization
New execution time
compared to before
perf_600_2 4.43 1 23%
perf_600_4 First time of use 1.67 First time of use
perf_600_6 4.73 2.44 52%
perf_600_6_q 8.14 5.83 72%
96
2.3.2.3 Reading and writing more words per LDM instruction
Two words can be read or written by a single LDM instruction, and this is usable when
working in C.16
This was used in the four loops that process symbol sizes 4 and 6 bytes.
For the former, each symbol is read and written from the column buffer to the segment
using two instructions. For the latter, the first two words of a symbol are read and written
using two instructions, followed by another two for the last word. Still no changes were
made to how q symbols are cleared.
Execution time changed as follows:
Table 14: The table shows how the optimization changed the implementation’s performance. The test
cases’ new execution times are being compared to the latest measured in Table 13. All presented execution
times have been normalized to the smallest one.
Test case Execution time before
optimization
Execution time after
optimization
New execution time
compared to before
perf_600_2 1 1.01 101%
perf_600_4 1.67 1.16 69%
perf_600_6 2.44 1.83 75%
perf_600_6_q 5.83 5.31 72%
2.3.2.4 Clearing the q symbols
In the next optimization, the q symbols were dealt with. Chapter 6, Paragraph 6.2.2
discusses two strategies for clearing them, and states which one was chosen for the first
version of the implementation. A switch of strategy was made in this optimization.
Specifically, the clearing of q symbols was moved from deint_segment to deint_column.
Algorithm 17 demonstrates how deint_column was changed by using symbol size 4
bytes as an example. A new parameter nr_q_symbols was added to the input of the
function. It specifies how many symbols in the end of the matrix’s column are q symbols.
When the segment’s column is to be read from CM to the column buffer, only those
symbols that are not q symbols are read. For each symbol size there are then four loops
that process the segment column.
The first two place the symbols that are not q symbols from the column buffer into the
segment. They still work as the loops described in the previous paragraph. The following
two loops clear the remaining q symbols of the column. The first is unrolled and clears
three q symbols per iteration, and the next loop clears any remaining q symbols. The first
loop clears only three q symbols per iteration due to the low number of q symbols in a
matrix. If this number is increased then the following loop will have to do more of the
work.
16 It is a question of what data type is used.
97
Algorithm 17: How deint_column works after the optimization is described. nr_q_symbols is a new
input parameter. Two new loops are introduced for each symbol size. They clear the q symbols. The first is
unrolled by clearing three q symbols per iteration, while the second clears any remaining q symbols.
nr_non_q_symbols = nr_symbols_in_column – nr_q_symbols;
read the column’s first #nr_non_q_symbols from CM
to the column buffer in LDM;
if (symbol size is 1 word) {
...
} else if (symbol size is 2 words) {
//The old loops only place the non q symbols
//from the column buffer into the segment.
for (
symbols 5*i to 5*i+4 in the column buffer,
where i = 0, 1, 2, ..., nr_non_q_symbols/5-1
) {
place symbol 5*i into the segment;
place symbol 5*i+1 into the segment;
place symbol 5*i+2 into the segment;
place symbol 5*i+3 into the segment;
place symbol 5*i+4 into the segment;
}
for (each remaining non q symbol in the column) {
place the symbol in the segment;
}
//The new loops come next. They clear the q symbols
//in the end of the segment’s column.
for (
symbols (nr_non_q_symbols+3*i) to (nr_non_q_symbols+3*i+2)
in the segment’s column, where i = 0, 1, 2, ...
) {
clear symbol nr_non_q_symbols+3*i in the segment’s column;
clear symbol (nr_non_q_symbols+3*i+1) in the segment’s column;
clear symbol (nr_non_q_symbols+3*i+2) in the segment’s column;
}
for (each remaining q symbol in the column) {
clear the symbol in the segment’s column;
}
} else { //symbol size is 3 words
...
}
Execution time changed as the following table shows. All test cases but perf_600_6_q
have slightly worsened because now the function deint_column deals with q symbols too
98
but none of these tests have such symbols. perf_600_6_q executes faster than perf_600_6
even if the matrices are equally large in both tests. This is because the earlier has four
columns that only contain q symbols. The columns are not read from CM, and instead the
corresponding symbols in the segment in LDM are only cleared.
Table 15: The table shows how the optimization changed the implementation’s performance. The test
cases’ new execution times are being compared to the latest measured in Table 14. All presented execution
times have been normalized to the smallest one.
Test case Execution time before
optimization
Execution time after
optimization
New execution time
compared to before
perf_600_2 1 1.01 101%
perf_600_4 1.15 1.17 101%
perf_600_6 1.82 1.85 102%
perf_600_6_q 5.28 1.75 33%
2.3.3 Performance of the implementation after C optimizations
When the implementations optimizations done in C had been completed, its
performance was measured by using the test series presented in Paragraph 2.2.2. The
following table demonstrates the results. Each row in the table represents a test series,
and each cell in a row shows a test case in that series. The longest and shortest execution
times of the job DSPs are shown per test case.
99
Table 16: For each test case, the shortest and longest execution times of the job DSPs are shown, however
the values have been normalized. Every execution time in the table has been divided by the shortest
execution time.
Number of job DSPs 1 2 3 4 5 6 7 8
Test series
1 1
2 10.70 5.52
5.48
3.91
3.77
3.29
3.25
2.84
2.80
2.53
2.39
2.30
2.16
2.17
1.99
3 15.91 8.10
8.10
5.69
5.69
4.49
4.49
3.76
3.76
3.28
3.28
2.93
2.76
2.68
2.68
4 19.49 10.20
10.20
6.68
6.68
5.24
5.24
4.37
4.37
3.79
3.79
3.36
3.20
3.08
3.08
5 30.35 15.21
15.21
10.74
10.74
7.76
7.76
6.42
6.42
5.52
5.52
4.85
4.71
4.41
4.41
6 30.95 15.91
15.91
10.50
10.50
8.10
8.10
6.65
6.65
5.69
5.69
4.87
4.80
4.49
4.49
7 38.94 19.49
19.49
13.09
13.09
10.20
10.20
8.47
8.46
6.68
6.68
5.73
5.66
5.24
5.24
8 60.39 30.35
30.35
20.42
20.42
15.21
15.21
12.53
12.53
10.74
10.74
9.59
9.49
7.76
7.76
Still the test case of series 1 shows that the implementation requires in average many
cycles per symbol. A separate implementation for small matrices should still be
considered if they are common.
In each of the test cases of series 2, the DSP that requires the shortest execution time
has been assigned the last section and therefore clears all q symbols. A DSP now benefits
from clearing q symbols. The execution times among the other DSPs are nearly equal.
When eight DSPs are used, the shortest execution time is 9% smaller than the longest. No
further load balancing is necessary for it would affect the longest execution time only
slightly.
Compare the execution times of the first and last tests of series 3. The latter is only one
sixth of the former, even if eight DSPs are used. The same observation can be made for
the following series, even though the situation is not equally bad for them since the input
is larger in their cases. But one must recall that the presented execution times do not
include the dispatch procedure. The conclusion is that it is now less beneficial to use
more DSPs than it was before optimizations were applied, and using more than eight
DSPs is probably a bad usage of EMPA’s job DSPs.
The following table compares each test case’s longest execution time after
optimizations in C have been applied to before optimizations were done. The values of
Table 16 are being compared to Table 10.
100
Table 17: The table shows how the C optimizations have improved the performance. The new longest
execution time of each test case is compared to the old before the optimizations.
Number of job DSPs 1 2 3 4 5 6 7 8
Test series
1 66%
2 16% 16% 16% 17% 18% 18% 19% 19%
3 13% 13% 13% 14% 15% 15% 16% 17%
4 12% 13% 13% 13% 14% 14% 15% 15%
5 16% 16% 17% 16% 17% 17% 18% 18%
6 12% 13% 12% 13% 13% 13% 13% 14%
7 12% 12% 12% 13% 13% 13% 13% 13%
8 16% 16% 16% 16% 16% 17% 18% 16%
Great improvements can be seen for each test case. For each series, the general rule is
that the observed improvement becomes worse as the number of DSPs increases. This is
because as they increase in number, the time each of them spends executing the function
deint_column decreases, and all optimizations were aimed at this function. The time they
spend executing other parts of the implementation has not improved because no
optimizations were aimed at those parts. Thus the observed improvement decreases as
more DSPs are used in each test series, with only a few exceptions. The exceptions are
because sometimes the number of segments suddenly decreases by one when an
additional DSP is used, which reduces the time spent executing parts of the
implementation that have not been optimized.
2.3.4 Optimizations performed in assembler code
After the optimizations performed in C code, the assembler code written for the
function deint_column by EMC was reviewed. It was inefficient and had many
deficiencies (Appendix 4 discusses the quality of EMC). It was deemed that the
performance can be greatly improved by performing further optimizations in assembler
code.
The amount of time that could be spent on these optimizations was limited. Therefore
the implementation was only partially optimized. Paragraph 2.3.4.1 explains this further.
It also describes initial changes that were made to the implementation in C code.
Paragraph 2.3.4.2 describes the assembler optimizations. Finally Paragraph 2.3.5 shows
how the execution time improved.
2.3.4.1 Initial changes before optimization
The function deint_column was replaced by six functions, one for each combination of
symbol size and number of columns. This was done in C code. To give an example and
establish a naming convention, deint_column_x_y is used when the matrix has #x
columns and a symbol is #y bytes long.
101
The division of deint_column (into new functions) by the three symbol sizes was made
to make it easier to overview the assembler code when optimizing it. In deint_column a
choice is made between twelve loops depending on symbol size (see Algorithm 17). The
four loops that are chosen are the only ones to be executed each time the function is
called in the same JOB. The assembler code produced contains all the twelve loops and
would be difficult to overview. The twelve loops were therefore divided among the new
functions that replaced deint_column. To give a precise example, the four loops shown in
Algorithm 17 were moved to deint_column_12_4 and deint_column_10_4.
Prior to performing optimizations in assembler code, the DSPs’ set of registers were not
well known by the author, and the amount of time left to work on the implementation was
very limited. It was deemed that the registers might not be sufficiently many to optimize
deint_column’s loops to the required extent. In the loops the number of columns must be
known to find the segment symbols that are to be modified. To avoid occupying a register
with this value in the loops, the function was also divided by the possible numbers of
columns. This makes the value explicitly known in each function that replaces
deint_column.
In hindsight the author knows that the set of registers that a DSP has at its disposal was
greatly underestimated. See Chapter 2, Paragraph 2.1.2 for an overview of the registers.17
Only two of the six new functions were optimized in assembler, namely
deint_column_12_2 and deint_column_12_6.
Prior to optimizing in assembler, the two unrolled loops in each function were slightly
changed in C. In deint_column_12_2, the first loop that places the column buffer’s
symbols into the segment (referred to as symb_loop_12_2) was changed so it processes
eight symbols per iteration. The second loop (q_symb_loop_12_2) now clears four q
symbols per iteration. In deint_column_12_6 the corresponding unrolled loops
(symb_loop_12_6 and q_symb_loop_12_6) process four symbols per iteration respectively.
These changes will be justified when the assembler optimizations are described.
When compiled by EMC the implementation performs as the following table shows.
The results are being compared to Table 15. At this stage the test case perf_600_2_q was
introduced. The clearing of q symbols of size 2 bytes is to be optimized in assembler and
the test case will show how the time required clearing those changes.
17 Specifically, the author did not know about the address registers and their offset registers. It was believed
that only the temporary registers can be used to execute the instructions of the loop, including LDM
instructions.
102
Table 18: The implementation was changed in C code before performing assembler optimizations. The
table shows how this changed the implementation’s performance. The test cases’ new execution times are
being compared to the latest measured in Table 15. All presented execution times have been normalized to
the smallest one.
Test case Execution time before
the changes
Execution time after the
changes
New execution time
compared to before
perf_600_2 1.11 1 90%
perf_600_2_q First time of use 1.04 First time of use
perf_600_6 2.02 2.08 102%
perf_600_6_q 1.92 1.93 101%
2.3.4.2 Description of the assembler optimizations
The assembler code produced by EMC was reviewed and the following table shows the
average number of cycles spent per symbol by the unrolled loops. This includes their
overhead. By modifying the assembler code all of them were optimized. The same table
shows how much they were improved.
Table 19: Each loop’s performance before and after the assembler optimizations are presented. The average
number of cycles per symbol is shown for every loop, however the values have been normalized to the
smallest value.
loop name Average number of cycles
spent per symbol before
assembler optimizations
Average number of cycles
spent per symbol after
optimizations
The new
performance
compared to before
symb_loop_12_2 2.1 1 48%
q_symb_loop_12_2 2.8 1 31%
symb_loop_12_6 4.2 2.2 46%
q_symb_loop_12_6 3.6 1.6 47%
To describe the assembler optimizations, take symb_loop_12_2 as an example. The
number of LDM instructions used to execute the loop was minimized. They were also
performed in parallel pair wise. In each iteration of the loop, four symbols are read from
the column buffer in LDM to temporary registers using two LDM instructions. They are
then written to the segment using four LDM instructions. No other instructions besides
those related to the loop’s overhead are performed. This motivates for only four symbols
per iteration, not eight as it was changed to. The purpose of this was to make the loop’s
overhead less significant compared to the actual work. At this moment the DSP’s
hardware support for executing loops without any overhead was not known by the author.
The remaining loops q_symb_loop_12_2, symb_loop_12_6 and q_symb_loop_12_6 were
improved in a similar manner. There is no conflict in hardware resource usage when
executing parallel LDM instructions in any of the four loops (Chapter 2, Paragraph
2.1.3.3 explains such conflicts and their consecuenses). But still none of the loops meet
the criterion for optimal performance (Chapter 4 states the criterion).
103
2.3.5 Performance of the implementation after assembler optimizations
Execution time of the test cases improved as the following table shows. The
performance is compared to before the assembler optimizations.
Table 20: The table shows how the assembler optimizations changed the performance. The test cases’ new
execution times are being compared to the latest measured in Table 18. All presented execution times have
been normalized to the smallest one.
Test case Execution time before
optimization
Execution time after
optimization
New execution time
compared to before
perf_600_2 1.77 1.06 60%
perf_600_2_q 1.85 1 54%
perf_600_6 3.69 2.46 67%
perf_600_6_q 3.42 2.19 66%
The rightmost columns of tables Table 19 and Table 20 reflect each other well. The
former shows that q_symb_loop_12_2 benefited more than symb_loop_12_2 by the
assembler optimizations. Only the latter loop is used for perf_600_2 but both of them are
used for perf_600_2_q. Because of this the execution time has improved more for
perf_600_2_q than perf_600_2.
perf_600_6 has improved less than perf_600_2. This is because the former’s matrix is
three times larger in memory size than the latter’s. It must spend at least three times more
cycles to perform CM instructions. It also requires twice as many segments to cover its
matrix.
2.4 Future improvements The following paragraphs show how the implementations should be further improved.
Paragraph 2.4.1 shows how the implementation should be changed to make it more
convenient for future changes. Paragraph 2.4.2 gives an idea how much the
implementation’s performance would improve by using the DSP’s hardware support for
executing loops. Paragraphs 2.4.3, 2.4.4 and 2.4.5 assume that the hardware support is
used to execute the critical loops, and then discuss how they can meet the criterion for
optimal performance.
2.4.1 Changes of convenience
The division of deint_column by possible number of columns for the reasons stated in
Paragraph 2.3.4.1 is not necessary. A DSP uses address registers to perform LDM
instructions and their value can be modified by using offset registers.
104
Therefore, only three functions are needed to perform the work of deint_column, one for
each symbol size. The number of columns must be in their inputs.18
These functions will
be referred to as deint_column_2, deint_column_4 and deint_column_6. By small
modifications of deint_column_12_2 and deint_column_12_6 (that were optimized in
assembler) one can obtain deint_column_2 and deint_column_6.
2.4.2 Using hardware support to execute loops
The DSPs hardware support for executing loops without overhead should be used for
the critical loops. To give a comparison, the following table shows how the unrolled
loops of deint_column_12_2 and deint_column_12_6 that were optimized in assembler can
be further improved. The results are being compared to Table 19.
Table 21: The table shows how each loop’s performance will improve if the DSP’s hardware support for
executing loops is used. The average number of cycles per symbol is shown for every loop, however the
values have been normalized to the smallest one.
Loop name Average number of cycles
spent per symbol by loop
without using hardware
support
Average number of
cycles spent per symbol
by using hardware
support
The performance with
hardware support
compared to without
symb_loop_12_2 2.5 1.5 60%
q_symb_loop_12_2 2.5 1 40%
symb_loop_12_6 5.5 3.5 64%
q_symb_loop_12_6 4 2 50%
It can clearly be seen that the two that clear q symbols will benefit more. This is
because their iterations are shorter than the two loops that must read from the column
buffer and write to the segment. Their overhead is more significant.
Nevertheless, this should greatly improve the execution times of all the four test cases
used during assembler optimizations.
2.4.3 Making the critical loop of deint_column_4 optimal
Consider the function deint_column_4 that will be used when the symbol size is four
bytes. If its critical loop is to process one segment column at a time as shown by
Algorithm 17, then there can be conflicts in hardware resource usage when executing
parallel LDM instructions (Chapter 2, Paragraph 2.1.3.3 describes such conflicts and
18 The division of deint_column by symbol sizes makes the assembler code easier to overview in the
author’s opinion, but this is just a matter of taste. It will also make further discussions easier to understand.
105
explains the consecuenses). The loop would not meet the criterion for optimal
performance (the criterion is stated in Chapter 4).
One solution to this problem is that deint_column_4’s critical loop processes two
consecutive columns of the segment each time it is called. So it begins by reading column
i*2 and 1*2 i from CM to the column buffer in LDM. It then places the symbols from
the buffer into the segment in LDM. But the problem is solved by in what manner the
LDM instructions are performed. First, two symbols belonging to the same row are read
from the two columns in the column buffer into temporary registers. These symbols will
be consecutive in the segment, and so they can be written to it by two parallel LDM
instructions. A pair of q symbols can be cleared in the same manner. Algorithm 18
specifies deint_column_4 by using this approach.
Algorithm 18: The algorithm shows how deint_column_4 should process two segment columns at a
time.
//The algorithm processes two consecutive segment columns 2*i and 2*i+1.
//Two CM instructions are used to read the non q symbols of the columns
//to the column buffer in LDM.
read the segment column a[0...n-1] from CM to the column buffer in LDM;
read the segment column b[0...n-1] from CM to the column buffer in LDM;
//The following loop places the symbols
//from the column buffer into the segment.
loop {
//Two symbols are read from the column buffer to two temporary registers.
//This is done by two parallel LDM instructions.
LDM instruction that reads a[j] from the column buffer to a0 |
LDM instruction that reads b[j] from the column buffer to a1;
//The two symbols are written from the temporary registers to the segment.
//This is also done by two parallel LDM instructions.
LDM instruction that writes a[j] from a0 to the segment |
LDM instruction that writes b[j] from a1 to the segment;
}
//The following loop clears the q symbols in the end of the columns.
loop {
LDM instruction that clears the q symbol
of row j and column 2*i in the segment |
LDM instruction that clears the q symbol
of row j and column 2*i+1 in the segment;
}
If the DSP’s hardware support is used for executing the shown loops then it is
guaranteed that they will meet the criterion for optimal performance.
On the other hand, the column buffer must be larger if two segment columns are to be
read from CM to LDM. This means the segments must be smaller. Specifically, each
segment can span over nrrows rows, where nrrows solves the following equation.
106
AVAILsizenrnrnr LDMsymbolrowsrowscols **2*
But the maximum size of a segment is reduced only marginally compared to if
deint_column_4 processes one column at a time (compare to Formula 9). This change
does not necessarily mean that more segments are required to cover the DSP’s section,
and the number of segments can in fact increase by at most one. The time lost in handling
an additional segment is minimal compared to what is gained by having optimal critical
loops.
2.4.4 Making the critical loops of deint_column_2 optimal
If deint_column_2 is to process only one segment column each time it is called, as
deint_column_12_2 does, then the criterion for optimal performance will not be met by its
critical loops. They would work as symb_loop_12_2 and q_symb_loop_12_2. The former
reads eight bytes from the column buffer by two parallel LDM instructions, but then uses
four LDM instructions that are parallel pair-wise to write them to the segment. The latter
only writes to LDM but it too uses four LDM instructions that are parallel pair-wise.
The problem is that LDM instructions that write to the segment are limited by the
symbol size. Each of them writes two bytes, while it is required by the criterion to write
four.
To address this, deint_column_2 can also like deint_column_4 process two consecutive
segment columns each time it is called. But there is a slight difference in how this must
be done. The reader should be patient throughout the following argument until Algorithm
19 is reached.
Four symbols can be read from the column buffer by using two parallel LDM
instructions. Two of the symbols belong to one row, and the other two belong to the next.
So their positions in the segment in LDM are pair wise consecutive. The four of them can
therefore be written to the segment using two parallel instructions. The problem is that
during this procedure the four symbols are in two temporary registers. When the symbols
have been read from the column buffer, each register contains two symbols of the same
column. When they are to be written to the segment, each register must contain two
symbols of the same row. The contents of two register parts must be swapped.
To do this and maintain an optimal loop is difficult due to reasons that are mentioned in
Paragraph 5.1. Specifically, the instruction that can be used to easily swap two register
parts can not be performed in parallel with two LDM instructions. Instead three
instructions and an extra temporary register are required for each pair of register parts
that are to be swapped. This requires the loop to process more than four symbols per
iteration. The following algorithm embodies the lengthy argument:
107
Algorithm 19: The algorithm shows in what manner deint_column_2’s critical loop should move
symbols from the column buffer to the segment.
//The algorithm processes two consecutive segment columns 2*i and 2*i+1.
//Two CM instructions are used to read the non q symbols of the columns
//to the column buffer in LDM.
read column s[0...n-1] of the segment from CM to LDM;
read column t[0...n-1] of the segment from CM to LDM;
//The temporary registers a0, a1, a2 and a3 will be used.
LDM instruction that reads s[0] and s[1] from the column buffer to a2 |
LDM instruction that reads t[0] and t[1] from the column buffer to a3;
//The following loop places the symbols
//from the column buffer into the segment.
//The variable i is only used for clarification.
//Eight symbols per iteration are processed.
//Instructions required to swap s[1] and t[0].
Instructions that begin swapping s[1] and t[0] between a2h and a3l;
//i = 2;
loop {
LDM instruction that reads s[i] and s[i+1] from the column buffer to a0 |
LDM instruction that reads t[i] and t[i+1] from the column buffer to a1 |
Instructions that finish swapping s[i-1] and t[i-2] between a2h and a3l;
//Instructions required to swap s[i-1] and t[i-2].
LDM instruction that writes s[i-2] and t[i-2] from a2 to the segment |
LDM instruction that writes s[i-1] and t[i-1] from a3 to the segment |
Instructions that begin swapping s[i+1] and t[i] between a0h and a1l;
//Instructions required to swap s[i+1] and t[i].
LDM instruction that reads s[i+2] and s[i+3] from the column buffer to a2 |
LDM instruction that reads t[i+2] and t[i+3] from the column buffer to a3 |
Instructions that finish swapping s[i+1] and t[i] between a0h and a1l;
//Instructions required to swap s[i+1] and t[i].
LDM instruction that writes s[i] and t[i] from a0 to the segment |
LDM instruction that writes s[i+1] and t[i+1] from a1 to the segment |
Instructions that begin swapping s[i+3] and t[i+2] between a2h and a3l;
//Instructions required to swap s[i+3] and t[i+2].
//i = i + 4;
}
//The loop stopped at i = j.
Instructions that finish swapping s[j-1] and t[j-2] between a2h and a3l;
//Instructions required to swap s[j-1] and t[j-2].
LDM instruction that writes s[j-2] and t[j-2] from a2 to the segment |
LDM instruction that writes s[j-1] and t[j-1] from a3 to the segment;
The unrolled loop shown meets the criterion for optimal performance. However, the
loop has been unrolled and must be followed by another one that places any remaining
symbols from the column buffer into the segment. This has been omitted. By processing
two segment columns each time, it is easy to construct a loop that clears four q symbols
per iteration and is also optimal. This has also been omitted.
108
2.4.5 Making the critical loops of deint_column_6 optimal
Earlier it was suggested that the function deint_column_12_6 can be modified to obtain
deint_column_6. Recall that the former processes only one column at a time. Its critical
loop symb_loop_12_6 does not meet the criterion for optimal performance, but in a sense it
is not far from achieving this. If the loop is changed so it becomes optimal then the
number of cycles it spends per symbol is reduced by only 14%. The improvement of the
implementation’s total execution time for this symbol size would be even smaller.
Besides from the lack of benefit, it is very hard if not impossible to improve
deint_column_6 the way deint_column_2 has been. The former can be improved by
processing two columns each time it is called as the latter. But Algorithm 19 shows how
words must be swapped between temporary register parts in deint_column_2’s critical
loop. If deint_column_6 is to process two segment columns at a time, then this task
becomes increasingly difficult to perform in its critical loop that places symbols from the
column buffer into the segment. There are too many pairs of register parts that must swap
contents and too few cycles to do so. All attempts to bring the loop to optimal
performance failed. The details are omitted. It is noteworthy though that with the
hardware changes suggested in Paragraph 5.1, it would be possible to bring the loop to
optimal performance.
However, if deint_column_6 is to process two segment columns, then it is easy to
construct a loop that clears four q symbols per iteration and is optimal. Nevertheless, if
deint_column_2 and deint_column_4 are to process two segment columns, then
deint_column_6 should do so too. After all the idea requires that the overall
implementation is restructured. However one of this function’s critical loops does not
need to be optimal.
109
Appendix 3. Rate dematching This appendix is devoted to specifying the processing algorithm that was used as a base
for the implementation of Rate dematching, and to explain how the implementation
underwent.
It is Paragraph 3.1 that specifies the processing algorithm. Paragraph 3.2 describes the
test cases that were used for verifying correctness and performance while writing the
implementation.
Paragraph 3.3 explains what steps were taken to optimize the implementation. The
improvements are described step by step, and how they changed the performance is
measured with the test cases. The implementations final performance is also presented.
The appendix is concluded by Paragraph 3.4 that explains how the implementation
should be improved in the future.
3.1 Specification of the processing algorithm The algorithm is organized around four structures. There are functions that extract data
from the structures or operate on them. The structures are rd_data, input_reader,
column_traverser and byte_array. They will be described in the following paragraphs.
They are used by the functions dematch, soft_comb_matrix, and soft_comb_col. They are
specified after the structures. The dispatch procedure of the JOB will be omitted.
3.1.1 Description of byte_array
byte_array is used to (i) transfer byte arrays between CM and LDM and to (ii) read or
write to single bytes in a byte array in LDM. The first task is complicated due to the
reasons stated in Chapter 7, Paragraph 7.2.3, while the second task is complicated
because a single byte in LDM can not be addressed when writing C code. The structure
addresses these two issues.
3.1.2 Description of rd_data
rd_data provides necessary data. A few examples are listed here:
Basic information such as the matrix’s number of rows, the length of E, and so
on.
The function P and its inverse.
The function Q and its inverse.
Number of NULL bytes in a column i before permutation.
Number of NULL bytes in a column i after permutation.
To calculate some of the values that are provided by the structure, calculations that have
been performed in advance and stored in LDM are used. Thus it requires a small number
of lookups in word arrays.
110
3.1.3 Description of input_reader
input_reader is used to traverse E in CM and to read necessary parts of it to LDM. It
makes sure that only the repetitions of the DSP’s matrix are read. Each repetition is read
in as large parts as LDM allows, but no column repetition is read partially. This structure
uses byte_array.
3.1.4 Description of column_traverser
column_traverser traverses a part of a repetition of the DSP’s matrix that has been read
from E to LDM. It traverses the part one column repetition at a time, and for each of them
it tells what matrix column it is to be soft combined with. It makes use of byte_array.
3.1.5 Specification of dematch
This function is executed by each job DSP in order to perform the processing step for its
matrix. It expects that the DSP’s matrix has been placed in LDM when it begins. The
function clear_matrix is used to clear the entire matrix if required. It will not be further
specified. The function soft_comb_matrix is used to perform the soft combination.
The function’s input is as follows:
LDM_matrix, a pointer to the matrix in LDM.
CM_input_E, a pointer to the array E in CM.
The integers D, S, N, T and the boolean CLEAR that are among the processing step’s
input parameters (see Chapter 5, Paragraph 5.3.2).
It works as the following algorithm shows.
Algorithm 20: The algorithm specifies the function.
rd_data data;
//Initiate the structure data.
init_data (data, D, S, N, T);
input_reader reader;
//Initiate the structure reader.
init_reader (reader, data, CM_input_E);
if (CLEAR == TRUE) {
clear_matrix (LDM_matrix, data);
}
//Perform the soft combination.
soft_comb_matrix (LDM_matrix, reader, data);
return;
111
3.1.6 Specification of soft_comb_array
This function performs the actual soft combination of the matrix column by column.
Every repetition of the matrix is read to LDM by using input_reader. A repetition in
LDM is traversed by a column_traverser. Each column repetition is then soft combined
to the matrix by the function soft_comb_col. The latter will not been further specified.
Its input is as follows:
LDM_matrix, a pointer to the matrix in LDM.
reader, the input_reader that is used to read repetitions in E from CM.
data, the rd_data structure that provides data.
The function works as follows:
112
Algorithm 21: The algorithm specifies the function.
column_traverser traverser;
//Initiate the structure column_traverser.
init_traverser (traverser, data);
byte_array buf_array;
//Initiate buf_array that will be used to read repetitions to LDM.
init_array (buf_array);
byte_array col_array;
//Initiate col_array that will be used by column_traverser
//to point out the beginning of the current column repetition.
init_array (col_array);
//Iterate as long as there are more repetitions of the matrix in E in CM.
while (has_more_input (reader)) {
//Read as much as possible of the current repetition from CM to LDM.
//Use buf_array for this task.
read_repetition_to_ldm (buf_array, reader);
//Retrieve the index of the first column repetition
//that has been read to LDM.
i = first_col_id (reader);
//Initiate the traverser to the first column repetition that has been read.
start_traverser (traverser, buf_array, i);
//Iterate as long as there are more column repetitions
//that have been read to LDM.
while (has_more_columns (traverser)) {
//Point out the beginning of the current
column repetition by using col_array.
point_to_curr_column (col_array, traverser);
//Retrieve the index of matrix column that
//it is to be soft combined with.
i = curr_col_id (traverser);
//Perform the soft combination between the column and column repetition.
soft_comb_col (LDM_matrix, col_array, i);
//Move the column_traverser one column repetition forward.
move_forward_one_col (traverser);
}
//Move forward the input_reader to the next part of the current repetition
//or to the beginning of the next repetition.
move_forward (reader);
}
return;
113
3.2 Test cases used for writing implementation This paragraph specifies the test cases that were used while writing an implementation
of the processing algorithm that has been suggested. There are tests for verifying
correctness and those that show how performance changes due to optimizations. The
execution time of tests that were used for measuring the implementation’s performance
can not be revealed. Throughout the chapter some tables discuss the implementation’s
performance by using these tests. In these tables the tests’ execution times have been
normalized. This means that every execution time in a table has been divided by the
shortest execution time in the same table.
3.2.1 Test cases for testing correctness
More than 300 test cases were used to verify the correctness of the implementation,
approximately 100 for each matrix U, V and W. The large number of tests was motivated
by how error-prone an implementation of the processing step is on EMPA, and more tests
were introduced when the implementation was optimized in assembler. Only a few test
cases in each series use very large input. The parameters that have been varied over the
tests are:
The size of the matrix.
The number of NULL bytes.
If the matrix is to be cleared or not.
The length of the byte array E.
The number of initial columns repetitions skipped in E.
The size of the buffer in LDM used to read the matrix’s repetitions in E from
CM.
The input arrays of each test case were generated randomly. Each test specifies an
output that the implementation is expected to produce, given the input. The output of the
test has been generated using a separate implementation. This implementation is simple
and follows the processing step’s specification step by step. Performance was not taken
into consideration while writing it in order to make it easier to verify its correctness.
3.2.2 Test cases for measuring performance
The following test cases were used to measure the performance of the implementation.
In each test the execution time of the function dematch is measured in number of cycles.
114
Table 22: Test cases used for measuring performance.
Test case’s name Number of
rows in
matrix
Length of
E (bytes)
If the matrix is to be
cleared before soft
combination
Size of the buffer in LDM
used to store repetitions
from E (bytes)
matrix_cleared_U 193 18528 TRUE 6148
matrix_cleared_V 193 18528 TRUE 6148
matrix_cleared_W 193 18528 TRUE 6148
matrix_U 193 18528 FALSE 6148
matrix_V 193 18528 FALSE 6148
matrix_W 193 18528 FALSE 6148
overhead_W 1 96 FALSE 6148
There are no NULL bytes in the matrices. The first repetition in E belongs to U in every
test, and no column repetitions have been skipped (see Chapter 7, Paragraph 7.1.3).
Compare the length of E to the number of rows in the matrix in each test. There is
exactly one repetition of the matrix in E. The first six tests use large input arrays. The
matrix is to be cleared in the three first. Note that the buffer in LDM is not large enough
to contain the matrix’s entire repetition in any test. It must be read partially two or three
times.
It will be interesting to know how much of a test’s execution time is spent on not (i)
clearing the matrix, (ii) executing CM instructions that read E nor (iii) executing the
function soft_comb_col. What remains besides these tasks is among other things
initiating and managing the structures that the implementation uses (see Paragraph 3.1).
32 columns are processed in every test case but in overhead_W each of them has only one
row. This test spends a minimal amount of time performing the three mentioned tasks and
can therefore be used to make the estimation for the other tests.
The function SatFunc that is used to perform soft combination requires that the sum is
adjusted if it is outside of a certain bound (see Chapter 5, Paragraph 5.3.2). In each test
every byte of the matrix and E has the value 127. This is to ensure that the sum is always
out of bounds, and the implementation is thus forced to make adjustments by each soft
combination.
3.3 Implementation of the processing algorithm The algorithm that has been described in Paragraph 3.1 was implemented by writing C
code. It was then further optimized. Paragraph 3.3.1 presents its initial performance and
describes the major optimization steps and their impact on performance. This includes
assembler optimizations, but they are further described in Paragraph 3.3.2. The
implementations performance after all the optimizations is discussed in Paragraph 3.3.3.
115
3.3.1 Major optimization steps performed
The first version of the implementation performed as follows
Table 23: The performance of the first version of the implementation is presented. The test case’s execution
times are shown, but the values have been normalized to the smallest one.
Test case Execution time
matrix_cleared_U 23.80
matrix_cleared_V 24.18
matrix_cleared_W 23.85
matrix_U 22.83
matrix_V 23.21
matrix_W 22.89
overhead_W 1
The tests that need to clear the matrix require more time. It might be surprising that the
three tests require nearly equally much time. The same observation can be made for
matrix_U, matrix_V and matrix_W. The repetitions of V and W have after all been
interlaced in E. One might think that their tests should require considerable more time to
execute. The reason is that bytes in LDM can not be addressed when writing C code and
in the function soft_comb_col bit arithmetic is used to obtain a word’s byte. It does not
therefore make much of a difference which matrix is being processed.
Chapter 7, Paragraph 7.2.1 suggests how the implementation can be improved for
matrices that are to be cleared. When the matrix has been cleared, its first repetition in E
can be used to fill it instead of performing soft combination. The implementation was
improved in this way by introducing the functions fill_matrix and fill_col. The former
reads the first 32 occurrences of the matrix’s column repetitions from E and inserts them
into the matrix. Each column is inserted by the latter function. fill_matrix will not be
further specified for it works almost exactly as soft_comb_matrix shown in Algorithm 21.
The only two differences are that fill_matrix’s inner loop is restricted to iterate no more
than 32 times, and it calls fill_col instead of soft_comb_col. The function dematch was
modified and the following algorithm shows how for clarification:
116
Algorithm 22: The algorithm specifies dematch after the optimization.
rd_data data;
init_data (data, D, S, N, T);
input_reader reader;
init_reader (reader, data, CM_input_E);
if (CLEAR == TRUE) {
clear_matrix (LDM_matrix, data);
//The first 32 occurrences of the matrix’s column repetitions in E are read
//and inserted into the matrix without soft combination.
fill_matrix (LDM_matrix, reader, data);
}
//The remaining relevant column repetitions in E are read and soft combined.
//Note how fill_matrix and soft_comb_matrix use reader,
//which keeps track of what column repetitions have already been processed.
soft_comb_matrix (LDM_matrix, reader, data);
return;
The performance was improved as the following table shows. As expected those tests
where the matrix is cleared execute faster.
Table 24: The table shows how the optimization changed the performance. The test case’s execution times
are shown, but the values have been normalized to the smallest one.
Test case Execution time before
optimization
Execution time after
optimization
New execution time
compared to before
matrix_cleared_U 23.80 16.89 71%
matrix_cleared_V 24.18 17.45 72%
matrix_cleared_W 23.85 16.61 70%
matrix_U 22.83 22.83 100%
matrix_V 23.21 23.22 100%
matrix_W 22.89 22.89 100%
overhead_W 1.00 1 100%
The assembler code produced by EMC was inefficient since in C code an LDM byte can
not be addressed. However, the assembler code had many other deficiencies due to
EMC’s quality. This is further discussed in Appendix 4. Because of this the functions
clear_matrix, fill_col and soft_comb_col were fully rewritten in assembler to optimize
the implementation further. Paragraph 3.3.2 describes the assembler optimizations. The
execution time improved as the following table shows. Pay attention to the improvement
for overhead_W. It has only one row in each column yet its execution time has been
reduced by nearly one seventh due to the optimization. This gives an idea of how
inefficient the assembler code was prior to the optimization. Now it also becomes
obvious that processing V or W requires more effort than U, no matter if the matrix is to
117
be cleared or not. Processing W requires a notable more amount of time than V due to the
difference between Permutation I and II.
Table 25: The table shows how the optimization changed the performance.
Test case Execution time before
optimization
Execution time after
optimization
New execution time
compared to before
matrix_cleared_U 19.27 1.47 7.6%
matrix_cleared_V 19.91 1.76 8.8%
matrix_cleared_W 18.94 1.78 9.4%
matrix_U 26.05 1.88 7.2%
matrix_V 26.48 2.21 8.3%
matrix_W 26.11 2.24 8.6%
overhead_W 1.14 1 88%
But what’s most alarming is the comparison of overhead_W’s execution time with the
others’. It shows that the time spent on not (i) clearing the matrix, (ii) executing CM
instructions that read E nor executing the functions (iii) soft_comb_col or (iv) fill_col is
now making for a significant share of any test’s execution time. All other tasks besides
the four mentioned are being performed by parts of the implementation that are still
written in C code and compiled by EMC. This suggests that improving those parts can
give a significant boost in performance, and this was done as a last optimization. The
execution times improved as follows.
Table 26: The table shows how the optimization changed the performance.
Test case Execution time before
optimization
Execution time after
optimization
New execution time
compared to before
matrix_cleared_U 2.38 1.80 75%
matrix_cleared_V 2.85 2.22 78%
matrix_cleared_W 2.89 2.23 77%
matrix_U 3.06 2.50 82%
matrix_V 3.59 2.96 83%
matrix_W 3.63 2.99 82%
overhead_W 1.62 1 62%
3.3.2 Description of assembler optimizations
While optimizing the implementation it was decided that the tasks of the functions
clear_matrix, fill_col and soft_comb_col should be optimized by rewriting them in
assembler code. clear_matrix has only one loop, and improving it to meet the criterion
for optimal performance was easy.
118
However this was not doable for the other two functions as they currently work. Each
function’s critical loop processes one column at a time and the matrix is in row by row
order in LDM. So an LDM instruction in the loop that reads or writes to the matrix
transfers only one byte. Also there may be conflicts in hardware resource usage when
parallel LDM instructions are executed. Thus the loop fails to uphold two points of the
criterion for optimal performance.
Having said that the functions were still optimized in assembler, fill_col was replaced
by two functions, namely fill_col_u and fill_col_vw. The former is used if fill_matrix
processes U, while the latter is used if either V or W is being processed. In the same way,
soft_comb_col_u and soft_comb_col_vw replaced soft_comb_col. The four functions were
directly written in assembler code.
Each of the four functions has an unrolled loop that processes 8 rows of the column per
iteration. The DSP’s hardware support is used for executing the loops (see Chapter 2,
Paragraph 2.1.4). To perform soft combination a very handy instruction is used in the
loops of soft_comb_col_u and soft_comb_col_vw, namely add4. add4 a0,a1,a2 treats each
of the four lower bytes of every temporary register as an individual integer. Each byte of
a0 is in this way added to the corresponding byte of a1. If the sum is larger than 127 or
smaller than -128 then it is adjusted to the closest boundary. The result is then stored in
the corresponding byte of a2. In this way soft_comb_col_u soft combines four rows of the
matrix’s column per add4 instruction. soft_comb_col_vw however soft combines only two
rows per instruction due to how the columns of V and W have been interlaced in E.
The critical loops have the following performance. It is only natural that the loops of
soft_comb_col_u and soft_comb_col_vw have twice the execution time of the ones of
fill_col_u and fill_col_vw. The former must additionally read bytes from the matrix
and do soft combination.
Table 27: The performance of each function’s unrolled loop is presented. The average number of cycles
spent per row of the matrix’s column is shown, but the values have been normalized to the smallest one.
Function name Average number of cycles spent per row of the matrix’s
column by the function’s unrolled loop.
fill_col_u 1
fill_col_vw 1.11
soft_comb_col_u 2
soft_comb_col_vw 2.22
3.3.3 The implementation’s performance after optimizations
The following table presents the number of cycles required to execute the function
dematch for every test case. Depending on what test is executed, either the critical loop of
fill_col or soft_comb_col is executed. The table also shows how large share of the
execution time is spent in the loop.
119
Table 28: The table presents the test cases’ execution times and shows how large share of each test’s time is
spent in the critical loop. The execution times have been normalized to the smallest one.
Test case Execution time of the test
case
Share of the test’s execution time spent in the
critical loop
matrix_cleared_U 1.75 42%
matrix_cleared_V 2.15 38%
matrix_cleared_W 2.17 37%
matrix_U 2.45 60%
matrix_V 2.90 56%
matrix_W 2.92 56%
overhead_W 1 0.85%
It becomes obvious that in the first six tests a large share of the execution time is not
spent in the critical loop. Yet the matrices in the tests are already as large as they can be
in the processing step. This can not be blamed on the CM instructions that read the
matrix’s repetition from CM to LDM. Nor is it because of the execution of clear_matrix
for the three first tests. These tasks do not make for a significant amount of the tests’
execution times.
The observations being made are due to how the implementation was developed. A
modularized approach was used, using multiple structures that perform individual tasks
(see Paragraph 3.1). This decision was made since writing an implementation for the
processing step was deemed to be very error-prone. In this way each structure’s functions
that operate on it could be tested for correctness individually. This benefit was very
important due to the limited time of the thesis that could be dedicated to resolve all the
issues that surfaced while implementing the processing step.
However the disadvantages of using a modularized approach instead of writing a
“seamless” implementation are now significant. They can be made more obvious by a
simple observation. This will also clarify the results in the table. Algorithm 21
demonstrates the function soft_comb_matrix before the optimizations. After the
optimizations, the call to soft_comb_col has been replaced by a conditional statement
with two outcomes. It chooses between either soft_comb_col_u or soft_comb_col_vw.
fill_matrix is nearly equal, and it suffices to observe the algorithm throughout the
following argument.
Now, consider one iteration of the inner loop of fill_matrix for the test
matrix_cleared_U after the optimizations. 62% of the iteration’s execution time is spent
on executing fill_col_u. This means that 38% of the time is spent on the constant
amount of work required per iteration. The latter number would be larger if the matrix
had fewer rows.
The conclusion is that the problem is mainly twofold. First, the columns of the matrix
are too short. Second, the constant amount of work per iteration is too large. While
nothing can be done about the former since it has to do with the processing step’s
specification, something can be done about the latter.
120
But the approach used to write the implementation can not solely be blamed. The
constant amount of work required per iteration is performed by parts of the
implementation that are still written in C code, and as Appendix 4 shows EMC is not of
high quality.
3.4 Future improvements The future improvements suggested in this paragraph are mentioned in a specific order.
It is believed that those that are mentioned first are of more practical use than the others,
due to the typical input parameters when executing Rate demtaching (see Chapter 7,
Paragraph 7.1.6). It is first suggested how the constant amount of work required for each
column repetition can be reduced. Next it is suggested how the critical loops of the
functions fill_col_u, fill_col_vw, soft_comb_col_u and soft_comb_col_vw can be
brought closer to being optimal. Recall that each of them currently processes only one
column repetition at a time. This means that the LDM instructions that write to the
matrix’s column write only one byte. Also there may be conflicts in hardware resource
usage when executing parallel LDM instructions.
3.4.1 Bypassing the structures
Now that the implementation has been completed it becomes more obvious what roles
the implementation’s structures fill and how these can be performed more swiftly by
bypassing the structures completely. To give an example, consider the inner loop of
soft_comb_matrix (see Algorithm 21). It processes one column repetition each time it is
executed and uses column_traverser to do so. The structure can be replaced by an
unsigned integer that points out the beginning of the current column repetition in the
buffer in LDM that stores repetitions of the matrix. This integer can be passed to
soft_comb_col_u or soft_comb_col_vw depending on what matrix is being processed. In
the end of the loop, it is calculated where the next column repetition begins. This can be
done by incrementing the unsigned integer. How much it is to be incremented by depends
on what matrix column the current column repetition has been soft combined with. This
can be precalculated and stored in a word array in LDM, thus requiring only a lookup.
The inner loop of fill_matrix can be improved in the exact same way. By simplifying
the inner loops to this degree, it becomes an easy task to further optimize them in
assembler and thus bypassing EMC completely. This should significantly reduce the
constant amount of work required per column repetition.
3.4.2 Processing consecutive columns to obtain optimality
Let us see if the critical loop of fill_col_u can be made optimal by modifying it. Recall
that it is used to directly insert the first 32 column repetitions into the matrix U when it
has been cleared. It is chosen for the sake of discussion because it does not need to
perform soft combination as in soft_comb_col_u, and its task is not as complicated as in
fill_col_vw due to interlaced column repetitions. Yet it will be shown that making the
121
chosen loop optimal is probably impossible, thus implying the case is the same for the
other functions. Also the difficulties that will be stated are only a few of all that were
encountered when attempting to do so.
The critical loop of fill_col_u needs to read bytes from column repetitions and insert
them into U’s columns. If the loop is to be optimal then every LDM instruction that
writes to the matrix must write four bytes to LDM. The bytes must of course be
consecutive in LDM. Since the matrix appears in LDM row by row the bytes span over
four consecutive columns. So the loop must process four column repetitions per iteration.
Yet every LDM instruction that reads from a column repetition must read four bytes.
Therefore the procedure sums up to (i) executing four LDM instructions that read a total
of 164*4 bytes from LDM into temporary registers, (ii) processing the bytes by
permuting them and then (iii) executing four LDM instructions that write them to LDM.
In what order this must be done is not relevant for this discussion. What must be
understood is that there are a total of eight LDM instructions, and they must be executed
in parallel pair-wise. This gives four pairs of parallel LDM instructions. Coupling the
instructions two and two to achieve this is not as simple as it may sound but this is not the
main problem. The criterion for optimal loop performance requires that all other
instructions are executed in parallel with LDM instructions. Thus the instructions that
reorder the 16 bytes into some temporary registers (before they are written to LDM) must
be executed in parallel with the four pairs of LDM instructions. This means that the extra
instructions must be arranged into four sets of parallel instructions, in such manner that
each set can be executed in parallel with two LDM instructions. This limit could not be
met by the author.19
But of course the critical loops do not need to be optimal and instead the stated
technique can be used to improve their performance significantly. But the improvement
would be of limited practical use due to the reasons stated in Chapter 7, Paragraph 7.1.6.
Most of the time when the processing step is to be performed the first repetition in E
belongs to U but the first two column repetitions have been skipped. Also there is hardly
ever a second repetition of U. This means that there are no column repetitions to soft
combine with the columns 00 P and 161 P of the matrix. The situation is worse
for V and W. For each of them there is hardly ever more than one repetition in E, and in it
some of the last column repetitions are missing. Take V as an example and say that the
last four column repetitions are missing. This means that there are no column repetitions
to soft combine with the columns 728 P , 1530 P , 2329 P , and 3131 P of
the matrix.
The technique is still useful if only a few columns of every matrix are missing column
repetitions. But if a critical loop is to process four consecutive columns each time it is
executed, then what should it do with those that have no column repetitions? A simple
19 The author even assumed that there is a fictional instruction that can copy any byte of one temporary
register to any byte of another, and that two such instructions can be performed in parallel with two LDM
instructions.
122
solution is to have three separate word arrays in LDM that are cleared. They can be used
as substitutions for column repetitions that are missing. Now if some of the four columns
are missing column repetitions when the loop is to be executed, then the corresponding
address registers can instead point to these word arrays that have been set to zero on
every position. The result will still be correct when performing soft combination due to
the simple observation iiSatFunc 0, for any byte i.
123
Appendix 4. Review of assembler code produced by EMC
During the thesis it became obvious that EMC produces assembler code that has many
deficiencies for the C code that it compiles. This had an unacceptable impact on the
performance of the thesis’ implementations when they were still solely written in C. The
most obvious reason was because the critical loops were compiled by EMC. The
following paragraphs give examples of critical loops that were written in C code during
the thesis. The assembler code produced for each of them by EMC is presented and the
code’s performance is discussed. The final paragraph draws some important conclusions
regarding EMC.
4.1 First critical loop The following C code shows a loop that moves data from one array in LDM to another.
Every array has two words in each position. This was a critical loop used in Channel
deinterleaver’s implementation.
Code 1: A critical loop written in C code.
while (current_src < src_limit) {
LDM_dst_32[current_dst] = LDM_src_32[current_src];
LDM_dst_32[current_dst + 10] = LDM_src_32[current_src + 1];
LDM_dst_32[current_dst + 2*10] = LDM_src_32[current_src + 2];
LDM_dst_32[current_dst + 3*10] = LDM_src_32[current_src + 3];
LDM_dst_32[current_dst + 4*10] = LDM_src_32[current_src + 4];
LDM_dst_32[current_dst + 5*10] = LDM_src_32[current_src + 5];
current_dst += 6*10;
current_src += 6;
}
In each iteration LDM_src_32 can be read using six LDM instructions. They can be
performed in parallel pair-wise. Likewise six LDM instructions are required to write to
LDM_dst_32 and they too can be performed in parallel. The LDM instructions can update
their address registers accordingly (see Chapter 2, Paragraph 2.1.3.3). However, the
following is the assembler code produced by EMC for the loop.
124
Code 2: The assembler code produced by EMC for the C code shown by Code 1.
.begin_iteration_label:
ld *r1++, a2 | mv 20, m0 | add 6, a1
st a2, *r0++m0 | mv 20, m0
ld *r1++, a2
st a2, *r0++m0 | mv 20, m0
ld *r1++, a2
st a2, *r0++m0 | mv 20, m0
ld *r1++, a2
st a2, *r0++m0 | mv 20, m0
ld *r1++, a2
st a2, *r0++m0 | mv 20, m0
ld *r1++, a2
st a2, *r0++m0
.loop_condition_label:
cmp a0, a1
brr .begin_iteration_label, .a1:lt
The assembler code will not be explained in detail but only enough to later show its
deficiencies. The instruction add 6,a1 increments the temporary register a1 by 6.
ld *r1++,a2 reads two words from LDM_src_32 to a2. st a2,*r0++m0 then writes the
words to LDM_dst_32. These two LDM instructions also modify the address registers r0
and r1 that point out the current positions in the respective arrays. The former increments
r1 by one, while the latter increments r0 by the value of the offset register m0. mv 24,m0
sets it to 24.
cmp a0,a1 compares the value of a0 and a1 and then sets some flags. The next
instruction uses one of these flags to determine if the loop should stop executing or
continue.
The following observations can be made:
m0 is set to the same value six times, even if it is not changed in between.
besides from the first set of parallel instructions, every other set contains only
one meaningful instruction (which happens to be an LDM instruction).
The loop’s execution time can be reduced by 47% by modifying only the assembler
code shown in Code 2. By modifying the code in this manner and using the DSP’s
hardware support for executing loops the time can be reduced by 65% (see Chapter 2,
Paragraph 2.1.4 for a description of this hardware support). This gives an idea of how
EMC is having an impact on performance.
The C code shown in Code 1 was modified in an attempt to make it more obvious for
EMC that LDM instructions can be performed in parallel. In each iteration the six
positions of LDM_src_32 were first read to temporary variables. These variables were then
written to the six positions of LDM_dst_32. In this way it should be obvious that six
consecutive read instructions are performed, followed by six write instructions. But the
125
execution time of the new loop produced by EMC was worse than before the
modification.20
4.2 Second critical loop The following C code shows a loop that clears an array in LDM. The array has two
words in each position. This was a critical loop used in Channel deinterleaver’s
implementation.
Code 3: A critical loop written in C code.
while (counter < counter_limit) {
LDM_dst_32[current_dst] = 0;
LDM_dst_32[current_dst + 10] = 0;
LDM_dst_32[current_dst + 2*10] = 0;
LDM_dst_32[current_dst + 3*10] = 0;
current_dst += 4*10;
counter += 4;
}
In every iteration the array’s four positions can be cleared by four LDM instructions,
and they can be performed in parallel pair-wise. However, the following is the assembler
code EMC produced for it.
Code 4: The assembler code produced by EMC for the C code shown in Code 3.
.loop_condition_label:
cmp a0, a1
brr .end_loop_label, .a1:ge
.begin_iteration_label:
mv 0, a2l | mv 20, m0 | add 4, a1
st a2, *r0++m0 | mv 0, a2l
mv 20, m0
st a2, *r0++m0 | mv 0, a2l
mv 20, m0
st a2, *r0++m0 | mv 0, a2l
mv 20, m0
st a2, *r0++m0
brr .loop_condition_label
.end_loop_label:
The instruction mv 0,a2l clears the entire of a2.21
This loop has the same deficiencies as
the previous. Note how no LDM instructions are performed in parallel. But in addition
the following can be observed:
20 In one attempt the six variables were declared with the keyword “register” in C to clarify to EMC that
their values are only temporarily stored.
126
a2 is set to the same value multiple times, even if it is not changed in between.
Two branch instructions are used when only one is necessary to construct the
loop.
The loop’s execution time can be reduced by 57% by modifying only the assembler
code shown in Code 4. By modifying the code in this manner and using the DSP’s
hardware support for executing loops the time can be reduced by 85%. Again the
compiler is having a significant impact on performance.
4.3 Conclusion In light of these observations, the following conclusions can be drawn regarding the
assembler code produced by EMC:
the DSP’s very short instruction pipeline is used inefficiently (by performing
unnecessary branch instructions that delay it),
unnecessary instructions are performed multiple times, and
the DSP’s capability to perform multiple instructions in parallel is used
inefficiently.
In ELTE it is a common practice to compile implementations with EMC without further
improving the result in assembler. This is due to a constant shortage of time. It is
therefore absolutely crucial that EMC is improved in order to use EMPA’s capabilities to
a greater extent.
21 This happens even if the instruction only addresses a register part. Sign extension occurs over the entire
register due to the standard configuration of a DSP.
127
Appendix 5. Hardware changes to EMPA During the thesis, some of the hardware aspects of EMPA made it notoriously difficult
to write efficient implementations. In this chapter it will be highlighted exactly how they
became an obstacle. Hardware changes that would make sure that this never occurred will
also be suggested. However a further discussion regarding if the changes are feasible
from the perspective of most implementations written by ELTE is necessary. A deeper
understanding of EMPA’s underlying architecture is also required to judge if the changes
are possible to perform. These discussions are outside the thesis’ scope and will be left at
the discretion of ELTE.
5.1 Swapping contents of temporary register parts Chapter 4 presents the criterion for optimal loop performance, which states that in the
loop every cycle must be used to transfer the maximal possible amount of data between
temporary registers and LDM. This requires that two parallel LDM instructions are being
executed by every cycle. Other instructions may only be executed in parallel with them.
It is possible that the loop’s task includes permuting the words that have been read from
LDM. This means words must be swapped between temporary register parts that hold the
data that has been read from LDM (see Chapter 2, Paragraph 2.1.2.1 to understand
temporary registers). For this purpose the instruction mv a0l,a1h comes in handy. It
copies the contents of a0l to a1h, and can do this for any pair of register parts. Such an
instruction will be referred to as a move instruction. So mv a0l,a1h|mv a1h,a0l is two
move instructions in parallel and the result is that the contents of a0l and a1h are
swapped.
The problem is that the total number of LDM instructions and move instructions
executed in parallel can not be more than two. So an LDM instruction and a move
instruction can be executed in parallel, but not even one move instruction can be used in
parallel with two LDM instructions. The conclusion is that a loop can not use move
instructions to perform its designated task and still be optimal.
The problem that move instructions can not be executed in conjunction with LDM
instructions has nothing to do with what temporary registers each instruction uses, for
they may be totally different. There are certain limits that must be met by any set of
instructions are to be executed in parallel. Two LDM instructions and two move
instructions meet all these limits. [11] states that the Data Address Arithmetic Unit of the
DSP can execute two instructions in parallel, and that this is the unit that executes LDM
instructions. A careful conclusion is that this unit also executes move instructions but this
was neither denied nor confirmed by [11]. If this is true then an interesting question
would be why does a unit of the DSP, which is mainly responsible for executing LDM
instructions, also execute an instruction that is purely arithmetical? A DSP has other
computational units that are tasked with many arithmetical instructions. Can they not be
tasked with executing move instructions too? For instance the instruction copy a0,a1
128
copies the content of a0 to a1, and it can be executed in parallel with two LDM
instructions.
In Paragraph 2.4.4 a loop was presented that had this exact task, namely reading data
from LDM, swapping the contents of register parts, and then writing it back to LDM. The
loop is optimal and because of this not a single move instruction could be used. For each
pair of register parts that had to switch contents, three instructions and an extra temporary
register were required. Algorithm 19 presents the loop and shows how these
complications were barely managed. In Paragraph 2.4.5 the problem was encountered
again but this time no solution was found. The pairs of register parts that had to switch
contents were too many, and so the same technique could not be used again.
It is very likely that a critical loop is tasked with reading data from LDM, permuting the
words that have been read, and then writing them back to LDM. In order to make it
easier to write such a loop that meets the criterion for optimal performance, the DSP
should be able to execute at least one move instruction in parallel with two LDM
instructions. Two is recommended, so that constructions like mv a0l,a1h|mv a1h,a0l can
be used to swap the contents of two register parts using only one cycle and requiring no
extra temporary registers.
TRITA-CSC-E 2010:169 ISRN-KTH/CSC/E--10/169-SE
ISSN-1653-5715
www.kth.se